Multi-task learning based feature extraction method in signal processing of high resolution remote sensing video images
Published Online: Mar 21, 2025
Received: Oct 28, 2024
Accepted: Feb 06, 2025
DOI: https://doi.org/10.2478/amns-2025-0695
Keywords
© 2025 Xinming Fan, published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Remote sensing (RS) refers to a detection technology that perceives targets or natural phenomena over long distances without direct contact, and narrowly refers to the use of various sensors (e.g., photogrammeters, scanners, and radars, etc.) on various platforms in high altitude and outer space to obtain information on the earth’s surface, and to study the shapes, sizes, locations, properties of objects on the ground, as well as their interrelationships with the environment, by means of data transmission and processing, a It is a modernized technical science [1-4]. Remote sensing is usually categorized into aerospace remote sensing and airborne remote sensing according to the different platforms of the sensors they carry. Video and images are the commonly available information for remote sensing systems. In order to process this information and analyze it to explore the parts of interest, the proposed image signal processing method is a necessary technical guarantee [5-8].
With the rapid development of modern science and technology, geographic information system (GIS) plays an increasingly important role in the national economy, while remote sensing has its own unique advantages in information acquisition, the combination of the two is bound to have a broad prospect in the military and civilian fields [9-12]. In the past decades, many feature extraction methods have been proposed, and the most classic ones are principal component analysis and linear discriminant analysis (LDA). However, in real life, people sometimes encounter situations where the number of training samples is very small, and at this time most of the traditional pattern recognition methods are not very effective, and it has been found that multi-task learning methods can help solve this problem [13-17]. The idea of multi-task learning technique is introduced into feature extraction, using other databases will have some relevant and useful data information to help extract better features, thus improving the efficiency of signal processing of high-resolution remote sensing video images [18-22].
The article introduces the principle of multi-task learning, applies the multi-task learning mechanism to the feature extraction of deep learning models, transforms a multi-class feature extraction and recognition problem into multiple binary feature extraction and recognition problems, and additionally introduces an edge extraction task, optimizes the loss function of edge extraction, improves the overall accuracy of the edge region, and establishes a multi-task based multi-decoder triple attention model. Taking Vaihingen, WHDLD, and DLRSD datasets as experimental objects, multiple semantic segmentation models are selected as comparison methods to explore the fitting effect of feature extraction methods through the Loss curve of model training. And compare the accuracy, memory usage, and frame rate of different model tests. Then on the three datasets, different feature extraction methods are applied to extract multiple categories of objects to explore the extraction accuracy of the MD TANet model for different specific features and verify the generalization ability of the model.
Inspired by the fact that humans often migrate experiences from similar tasks when learning new knowledge, multitask learning is a machine learning method that obtains additional information about the main task from multiple related tasks to influence the task outcome. It aims to improve prediction efficiency and accuracy through inductive migration and shared representation. It has been widely used in the fields of natural language processing, machine vision, and speech information recognition.
According to the differences in the hidden layer parameter sharing methods during CNN training, multi-task learning is categorized into soft parameter sharing and hard parameter sharing, and the multi-task learning parameter sharing methods are shown in Fig. 1. Among them, soft parameter sharing builds independent models with similar structure for different tasks, the parameters of different models are independent of each other, and the parameter constraints and task associations are realized by regularization. In the case of low task similarity, using soft parameter sharing can instead improve the diversity of extracted features. Hard parameter sharing shares the underlying parameters among different tasks and then designs different model outputs to meet the requirements of each task. Compared with soft parameter sharing with looser constraints, hard parameter sharing methods usually use part of the same parameters directly between different tasks, so they require a higher degree of correlation or similarity of tasks, and the underlying features of each task should be similar. Common hard parameter sharing multi-task neural network models are mostly built based on coder-decoder structure, and the models often share encoders and design corresponding decoders according to different tasks. The multi-decoder triple attention model designed in this paper adopts the common encoder-decoder structure and performs the feature extraction and decoding functions respectively.

The parameter sharing method of multi-task learning
The biggest advantage of multi-task learning is that it helps optimize the generalization error of the model and improve the overall performance of the model compared to a single task. The reason is that, when multiple tasks are trained at the same time, the representation of the model and the sample size of the data are implicitly weakened and enhanced, respectively, relative to a single task, and both of them have a positive effect on the reduction of the risk of overfitting of the original model. As the multi-task jointly learns the noise suppression model under different tasks, it helps to improve the overall task accuracy. In the multitasking mode, different tasks perform backpropagation of the gradient at the same time, and there is a risk that a single task is prone to fall into local optimization, whereas multiple related but differing tasks tend to have different locally optimal solutions, and the probability of the gradient being close to the globally optimal solution is greater in the case of joint learning.
The current mainstream multi-task deep learning model is dominated by the hard parameter sharing method, and the PAD-Net morphology in the multi-task hard parameter sharing model is further optimized for the multi-task model based on the attention mechanism, in view of which, this paper combines the attention mechanism to construct a multi-task learning model. For high-resolution remote sensing video image signal processing, the triple attention model effectively solves the problem of insufficient global features of the model, which makes it difficult to extract a wide range of targets, but there is still misjudgment for similar features, and the multi-task learning mechanism is used to improve the extraction accuracy of remote sensing image features in the deep learning model.
Attention Mechanisms (AM) is a technique that has gained widespread use in deep learning, especially in Natural Language Processing (NLP) and Computer Vision (CV). It draws on the principles of human visual attention, i.e., the ability to focus on certain key parts that tend to be more important to the task at hand when confronted with a large amount of information. When confronted with a new image, people’s eyes tend to be attracted to areas of interest or obvious color contrasts in the image, and in the process of extracting key information, they tend to temporarily ignore other detailed parts of the image. Introducing the attention mechanism in deep learning models helps the models to recognize the most informative parts when processing sequential data. Attention mechanisms can be broken down into channel attention mechanisms and spatial attention mechanisms.
The channel attention mechanism and the core idea is that not all channels are equally important in a given feature response (i.e., the feature map or feature representation generated by the convolutional layer).
A pooling operation, including average pooling and global pooling, is first performed on each channel to obtain the global statistical features for that channel. This operation reduces the 2D feature map to a single scalar value, which provides the basis for subsequent learning of each channel weight. Next, a small multilayer perceptron (typically containing one or two hidden layers) is used to learn the weights of each channel from the globally pooled features; the number of hidden layers, as well as their dimensions, need to be decided based on the specific application. Finally, a Sigmoid activation function is used to normalize these weights such that each of them has a weight between 0 and 1. These weights are subsequently used to weight the original feature channels, thus improving the network’s sensitivity to important features. Based on the above description, the channel attention is computed using the formula below:
where
The spatial attention mechanism is an algorithm used to enhance the attention of a neural network to features at specific spatial locations in the input data. In visual tasks, this mechanism recognizes that pixels at different locations in an image contain varying degrees of informativeness, and that certain regions may be more critical to the completion of a particular task. The introduction of spatial attention allows the model to dynamically adjust the focus of its processing, similar to how the human visual system prioritizes those parts of a scene that are meaningful when observing it. In standard CNNs, the convolutional operations at each layer give equal importance to each spatial location of the input feature map. However, in numerous visual tasks, different spatial locations contribute differently to the final output decision. Spatial attention mechanisms are proposed to enable the model to automatically identify and focus on the most useful regions, thereby improving performance. A typical spatial attention module usually takes the following steps:
First global average pooling and maximum pooling are performed, and then the feature maps after global average pooling and maximum pooling are spliced to obtain a feature map dimension of
Again,
Aiming at the multi-categorization feature extraction problem of high-resolution remote sensing video images, this paper constructs a multi-task-based multi-decoder triple-attention model (MD TANet), which transforms a multi-categorization semantic segmentation problem into a multiple binary-categorization semantic segmentation problem, where each category constructs a separate decoder, which consists of multiple attentional modules, and each decoder pays attention to the corresponding category only, without considering the features of other categories, thus reducing the inter-category competition. There is no need to consider the features of other categories, thus reducing competition between categories.
In order to enhance the linkages within the features, two self-attention modules are proposed by introducing DANet.
In the positional attention module, a connection is established for the input local features in the positional dimension by first transforming the features into the form of Q, K, and V using convolution, and in order to model the positional features, it is necessary to match the shapes of Q, K, and V to obtain an (
Where
The channel attention module switches the position of the dot product of the position attention module, and the shape of the generated attention weight map is (
Combined with the attention mechanism, the structure of the labeled attention module is shown in Fig. 2. Due to the increase in the number of channels of the attention probability map, the attention probability map is no longer obtained by multiplying Q and K dots, but by direct convolution of the input features. Similarly, the attention output is convolved with the original input and then summed to obtain the final output. Equations (7) and (8) describe the overall structure of the labeled attention module:

The structure of the label attention module
Due to the introduction of additional loss functions in the labeled attention module, the way the parameters are updated during the backpropagation of the training model changes, in the labeled attention module, the parameters will be constrained by two loss functions at the same time, the segmentation loss function and the attention loss function, both of which work together to optimize the parameters as shown in Eqs. (9) and (10):
The model employs an encoder-decoder architecture, the encoder is basically the same as the VGG architecture adopted by UNet, and each decoder contains three attention modules containing two self-attention modules and one labeled attention module. As shown in equations (11) and (12):
The dataset contains a total of
In the label attention module, since each decoder corresponds to only one category, the number of attention channels of label attention is changed accordingly, from the original multi-channel to a single-channel probability map. After each decoder gets the predicted probability of the corresponding category, it directly combines the probability maps of each category, and the category with the highest probability is selected at each point as the final prediction result. Since the labeled attention model has already hinted at the feature extraction region in the attention module, which is equivalent to a layer of weak constraints, there is no need to do further fusion of the overall results to avoid the loss of feature information. In terms of the loss function, due to the splitting of the task, the loss function of each category is calculated separately, as shown in Eqs. (13), (14) and (15):
The final overall loss function is the sum of all the attention loss functions and the semantic segmentation loss function.
In order to enhance the attention of the edge region, the boundary information between different features can be easily obtained by using the labels in the training samples, so the structure of the label attention is reused, and the label information is replaced by the boundary information, so that the model’s attention is focused on the boundaries of different feature types.
The loss function part of the model adds a new distance transform loss function (for the edge extraction task) and a root mean square error loss function (for the distance map extraction task) based on MD TANet.
The calculation of the distance transform loss function firstly needs to calculate the distance graph
The distance transform loss function is the sum of the distances of the points where the result of the prediction map is different from the result of the true value map, as shown in equations (16) and (17):
Wherein,
In the distance map extraction branch part, since the values in the distance map of the labels are continuous values, the common root mean square error loss function can be used to measure the distance of the prediction results from the labels, as shown in equation (18). Combining all the above loss functions, the final loss function consists of four parts: the attention loss function
Overall, in this paper, in order to optimize the feature extraction accuracy in the edge region, the edge extraction task is additionally introduced, which contains a total of the following three optimizations.
Add edge attention module, which increases the decoder from three attention modules to four, so that the model pays more attention to the extraction accuracy of the edge region. Add an edge extraction branch, which uses a distance transform loss function to sum the distance values of misjudged points and take targeted optimization of the model using the mispredicted points. Add the distance map generation branch, which transforms the edge detection task into the distance map generation task, and directly adopts the root mean square error loss function, thus improving the extraction accuracy of the edge region.
In order to validate the effectiveness of the multi-task learning based feature extraction method in this paper, test experiments and comparative experiments on Vaihingen, WHDLD, and DLRSD datasets are conducted in this chapter. This section provides a detailed description of the experiments.
The Vaihingen datasets are remote sensing image datasets for semantic segmentation tasks, including orthorectified remote sensing images and corresponding digital surface model maps. Each dataset is manually labeled with six ground object categories: impervious surfaces, buildings, short vegetation, trees, cars, and others.
The Wuhan Densely Labeled Dataset (WHDLD) is a remote sensing image dataset for semantic segmentation tasks, which is obtained by cutting a very large sized remote sensing image of Wuhan city. The dataset is manually labeled with six ground object categories: buildings, roads, sidewalks, vegetation, bare soil, and water. The dataset contains a total of 4940 RGB images with a ground sampling spacing of 2m and their corresponding labeling data, all of which have an image size of 256×256.
Densely Labeled Remote Sensing Dataset (DLRSD) is another remote sensing image dataset for semantic segmentation tasks. The dataset contains a total of 17 labeling categories, which are: aircraft, bare soil, buildings, cars, chapels, yards, docks, grounds, grass, mobile homes, sidewalks, beaches, oceans, ships, tanks, trees, and other waters. The dataset contains a total of 2100 RGB images with ground sampling spacing of 0.3m and their corresponding labeling data, all of which have an image size of 256×256.
In order to verify the performance of the multitasking network model, the experiments are completed on the servers of the high-performance computing platform, and the configuration of the experimental platform is shown in Table 1.
Experimental platform configuration
| Name | Configuring | |
|---|---|---|
| Hardware | CPU | Intel(R) Xeon(R) Gold 5115 CPU@2.40GHz |
| Memory | 128GB | |
| GPU | Tesla P100 | |
| Display storage | 16GB | |
| Software | Programming language | Python 3.7.15 |
| Compiling environment | Pycharm | |
| Anaconda version | Conda 4.10.3 | |
| Depth learning framework | Pytorch 1.9.0 | |
| Operating system | CentOS 7.7 |
When evaluating the performance of deep learning network-based models, they are generally categorized into three aspects: accuracy, memory footprint, and frame rate.
AP and mAP The average accuracy AP is associated with two metrics: precision and recall. True Positive (TP) indicates how many samples with positive labels are predicted to be positive cases, and True Negative (TN) indicates how many negative samples are actually also predicted to be negative cases, and the larger the percentage of TP and TN, the better. False Positive (FP) represents the number of wrong detections, i.e., the number of predicted positive cases that are actually negative samples, and False Negative (FN) represents the number of missed detections, i.e., the number of predicted negative cases that are actually positive samples. Precision represents the proportion of true positive samples predicted by the network in the overall true samples of the detection results, as shown in Equation (20), which reflects the detection rate of the model:
Recall represents the proportion of true-positive samples predicted by the network to the true true samples, as shown in Equation (21), which reflects the model’s check rate:
For the performance of the model, the higher the values of Precision and Recall, the better the performance of the model, but these two are often contradictory to each other, in order to better evaluate the performance of the algorithm, different Recall values as the horizontal coordinates, Precision values as the vertical coordinates, to draw the PR curve, which integrates the results of Precision and Recall, the PR curve The accuracy on the curve takes the mean value which is the average accuracy (AP), which is commonly used to evaluate the detection effect of a single category.The area under the PR curve is given by equation (22):
Where Parameters and Parameters Size Parameters refers to the number of parameters included in the model, i.e., the number of parameters that the model needs to learn in the training process. Take the convolution used in the model as an example, suppose the number of channels of the input feature map is
Generally use the number of parameters Parameters to measure the model size Parameters Size. the number of parameters directly determines the size of the model, which is measured in units of one, but due to the large number of parameters of the model, it is generally taken as a unit measure of megabytes (M), and corresponds to the unit of the model size of the model is MB.
FLOPS and FPS Floating point operations FLOPS represents the number of floating point operations per second, i.e., “throughput”, which measures the computational time complexity of the model, and is often used as an indirect measure of the speed of the model, and the larger its value, the better. Frame rate FPS indicates the number of frames processed per second, i.e., the number of images that can be processed per second or the time required to process an image, evaluating the speed of model execution, the shorter the time, the faster the speed. MIoU By counting the ratio of the intersection and concatenation of the true and predicted values, IoU can measure the degree of overlap between the predicted results and the true labels:
In the feature extraction algorithm, IoU denotes the ratio between the number of pixels in the common part of the target region obtained from the network prediction and the target region in the corresponding label and the number of pixels covering the concatenation of the two.
The average intersection and merge ratio MIoU, i.e.,
In order to investigate the impact of multi-task learning, this paper conducts comparative experiments on the selected Vaihingen dataset, comparing different feature extraction methods such as U-Net, EANet, PSPNet, BiseNetv2, CCNet, RefineNet, EMANet, and Deeplabv3+, respectively.
The above model and the MD TANet model of this paper are trained under the experimental environment configuration in Section 4.2.1. During the training process, a multi-threaded approach is used to read data to speed up the data reading speed. The training batch size batch size is set to 64, the initial learning rate are set to 0.01, both use the cos strategy to adjust the learning rate, both use the SGD optimizer for parameter updating, and the epoch number is set to 250.
The difference between the predicted value of the model and the true value of the label is evaluated by the loss value. The smaller the loss value is, the better the model fits the true value. After training the model for 250 epochs, the Loss curve’s change during training can be seen in Fig. 3, and the loss value of the model converges to 0-8, and the MD TANet model converges with the lowest loss value of 0.17, which is the best fitting effect.

The loss value of different model training
In order to better evaluate the network performance, five metrics, mAP, mIoU, FPS, Parameters and FLOPS, are chosen to evaluate the accuracy of the detection segmentation results, and 0.5 is used as the threshold for the confidence level, so its corresponding evaluation criteria can be expressed as mAP@50 and mIoU@50.
The evaluation of the results of different models on the validation set is shown in Table 2. Experiments show that the MD TANet model has the best recognition accuracy on the Vaihingen validation set, with the mAP@50 and mIoU@50 reaching 78.10% and 75.98%, respectively, the former is 1.16%~5.96% higher than the comparison method, and the latter is improved by 2.43%~6.38%. In terms of speed, the multitasking model MD TANet achieves a high real-time performance of 90.72 FPS while increasing the number of small parameters.The comparative analysis of the above experimental results proves that the improved multitasking model MD TANet achieves a better balance between accuracy and speed, and shows a better performance of remote sensing video image feature extraction.
The evaluation results of different models in the validation set
| Model | Input size | mAP@50 | mIoU@50 | FPS | Parameters | FLOPS |
|---|---|---|---|---|---|---|
| U-Net | 256*256 | 73.86 | 70.62 | 82.42 | 10.57 | 25.23 |
| EANet | 256*256 | 75.88 | 72.31 | 94.22 | 7.93 | 28.13 |
| PSPNet | 256*256 | 72.14 | 69.60 | 73.35 | 11.49 | 22.12 |
| BiseNetv2 | 256*256 | 74.81 | 72.82 | 68.32 | 13.31 | 33.49 |
| CCNet | 256*256 | 76.94 | 73.55 | 76.22 | 10.77 | 53.35 |
| RefineNet | 256*256 | 76.42 | 72.48 | 88.65 | 9.42 | 40.44 |
| EMANet | 256*256 | 75.78 | 70.69 | 73.76 | 12.07 | 29.65 |
| Deeplabv3+ | 256*256 | 76.49 | 73.54 | 76.21 | 11.25 | 35.37 |
| MD TANet | 256*256 | 78.10 | 75.98 | 90.72 | 10.38 | 26.54 |
In order to further validate the effectiveness of the proposed multitasking model MD TANet, feature extraction experiments for specific features are conducted on Vaihingen, WHDLD, and DLRSD datasets, where the comparison experiments conducted on DLRSD dataset are selected to be conducted by several algorithms that have better performance on other datasets.
Comparison experiment results on Vaihingen dataset Using the Vaihingen dataset, the performance of this paper’s method and the comparison algorithms are compared. The test results on the Vaihingen dataset are shown in Fig. 4, and Table 3 shows the test results of different models on Vaihingen dataset. The mIoU of this paper’s algorithm on the Vaihingen dataset reaches 75.98%, and except for the feature extraction accuracy of the automobile category (68.89%), which is slightly lower than that of RefineNet (69.02%) and CCNet (70.47%), the rest of the categories and the overall accuracy have achieved a certain improvement. Among the comparison methods, PSPNet has the lowest accuracy of 69.60%, mainly due to the difference in the accuracy of the other categories.Deeplabv3+ has a relatively high accuracy of 73.54%. Since the methods in this chapter are optimized for the edges of category objects, consistent optimization for all categories should be able to achieve some improvement in most categories, which is supported by the experimental results. For the MD TANet method in this paper, the accuracy in roads and buildings is better, reaching more than 80%. The lowest accuracy among the main 5 categories is for cars, with only 68.89%. Cars have relatively low segmentation accuracy due to the fact that the objects are significantly smaller and the features are not easily captured. Comparison experimental results on WHDLD dataset The test results for the WHDLD dataset are shown in Fig. 5, and Table 4 shows the test results of different models on the WHDLD dataset.The mIoU metrics of the MD TANet model on the WHDLD dataset achieve a certain improvement compared to the individual comparison algorithms, reaching 67.55% of the mIoU metrics. The MD TANet method is ahead of the other algorithms in terms of cross-comparison metrics in each category. The multitask learning-based feature extraction method in this paper achieves an improvement of 3.95% to 6.61% compared to the comparison methods. For this paper’s method, the accuracy is significantly better in the two categories of vegetation and water, reaching more than 82%. As water is easier to distinguish, it even reaches more than 90%. While the accuracy in the bare soil category is relatively low, the intersection and merger ratio index is only 48.96%. Comparison experimental results on DLRSD dataset Using the DLRSD dataset, the performance of this paper’s method and some cutting-edge feature extraction algorithms are compared. The test results on the DLRSD dataset are shown in Fig. 6, and Table 5 shows the test results of the different models on the DLRSD dataset.The MD TANet model is still superior to the comparative methods on the DLRSD dataset for more categories, and it is able to achieve 72.20% on the mIoU. For the crossover ratio of categories, the MD TANet model was able to achieve the highest feature extraction results on more than half of the categories, despite the large number of categories in this dataset. Among them, the categories of church (59.33%), yard (84.99%), and ship (77.07%) are able to achieve significant improvements compared to other methods.

Test results on the Vaihingen data set
Test results of different models on the Vaihingen data set
| Models | Roads | Building | Dwarf vegetation | Tree | Car | Other | Average (%) |
|---|---|---|---|---|---|---|---|
| MD TANet | 85.92 | 87.81 | 75.71 | 78.34 | 68.89 | 59.22 | 75.98 |
| U-Net | 80.25 | 80.34 | 67.55 | 72.26 | 65.75 | 57.55 | 70.62 |
| EANet | 81.49 | 83.84 | 71.53 | 74.96 | 66.86 | 55.16 | 72.31 |
| PSPNet | 80.56 | 80.15 | 67.71 | 69.71 | 67.62 | 51.85 | 69.60 |
| BiseNetv2 | 82.91 | 82.51 | 73.74 | 73.83 | 68.51 | 55.41 | 72.82 |
| CCNet | 82.42 | 83.57 | 73.15 | 73.84 | 70.47 | 57.82 | 73.55 |
| RefineNet | 83.14 | 82.21 | 71.87 | 70.28 | 69.02 | 58.37 | 72.48 |
| EMANet | 81.51 | 81.84 | 69.88 | 69.37 | 66.18 | 55.33 | 70.69 |
| Deeplabv3+ | 83.93 | 83.54 | 73.57 | 74.21 | 69.52 | 56.48 | 73.54 |

Test results on the WHDLD data set
Test results of different models on the WHDLD data set
| Models | Building | Roads | Walkway | Vegetation | Naked soil | Water | Average (%) |
|---|---|---|---|---|---|---|---|
| MD TANet | 62.61 | 68.74 | 49.54 | 82.01 | 48.96 | 93.44 | 67.55 |
| U-Net | 60.41 | 59.42 | 42.01 | 78.71 | 36.97 | 90.79 | 61.39 |
| EANet | 53.11 | 63.63 | 47.76 | 79.21 | 35.15 | 90.76 | 61.6 |
| PSPNet | 60.12 | 62.75 | 46.89 | 73.29 | 32.99 | 92.86 | 61.54 |
| BiseNetv2 | 58.35 | 64.14 | 48.18 | 71.91 | 32.57 | 91.09 | 61.04 |
| CCNet | 61.51 | 64.25 | 45.86 | 77.95 | 39.03 | 93.02 | 63.6 |
| RefineNet | 61.14 | 63.02 | 42.01 | 79.39 | 37.72 | 93.29 | 62.76 |
| EMANet | 59.69 | 62.73 | 47.87 | 73.11 | 37.36 | 92.18 | 62.16 |
| Deeplabv3+ | 61.57 | 62.93 | 46.98 | 79.62 | 35.05 | 92.19 | 63.06 |

Test results on the DLRSD data set
Test results of different models on the DLRSD data set
| Models | MD TANet | EANet | CCNet | RefineNet | Deeplabv3+ |
|---|---|---|---|---|---|
| Airplane | 68.30 | 64.65 | 62.58 | 62.89 | 64.76 |
| Naked soil | 59.05 | 53.16 | 54.51 | 53.89 | 53.49 |
| Architecture | 71.44 | 74.51 | 73.38 | 72.40 | 72.39 |
| Car | 67.53 | 62.20 | 65.23 | 64.38 | 65.20 |
| Church | 59.33 | 53.020 | 51.380 | 53.110 | 53.55 |
| Courtyard | 84.99 | 81.28 | 80.53 | 81.72 | 80.22 |
| Quay | 54.60 | 42.82 | 46.30 | 46.88 | 45.73 |
| Field | 91.83 | 93.6 | 90.93 | 94.66 | 93.25 |
| Grass | 64.28 | 66.18 | 65.11 | 64.78 | 64.88 |
| Mobile house | 61.07 | 63.96 | 62.1 | 61.06 | 60.84 |
| Walkway | 77.61 | 76.94 | 73.11 | 75.71 | 76.83 |
| Sand beach | 65.71 | 71.78 | 62.75 | 68.89 | 68.18 |
| Ocean | 96.73 | 93.25 | 90.79 | 92.56 | 91.86 |
| Ship | 77.07 | 65.04 | 70.23 | 72.28 | 70.15 |
| Tanks | 71.69 | 76.44 | 75.07 | 73.61 | 68.31 |
| Tree | 74.76 | 72.60 | 73.35 | 69.93 | 71.01 |
| Other waters | 83.44 | 82.18 | 81.75 | 80.98 | 80.04 |
| Average (%) | 72.20 | 70.21 | 69.36 | 69.98 | 69.45 |
In this study, a multi-task learning feature extraction method, MD TANet, is innovatively developed, which is specifically designed for the problem of multi-category feature extraction in high-resolution remote sensing video images, and effectively improves the feature extraction accuracy. Through the introduced multi-task learning framework, a multiclassified semantic segmentation problem is transformed into multiple biclassified semantic segmentation problems, which avoids the competition relationship between different classes for parameters and reduces the misclassification among similar features.
The experimental results confirm that MD TANet combined with the multi-task learning framework outperforms other feature extraction in terms of overall accuracy, with average accuracy and mIoU values improving by 1.16% to 5.96% and 2.43% to 6.38%, while achieving more than 90 FPS in terms of frame rate. The overall mIoU values of the MD TANet model are 75.98%, 67.55%, and 72.20% for different feature extraction of features in the Vaihingen, WHDLD, and DLRSD datasets, respectively, and are higher than the other comparative methods in terms of the accuracy of feature extraction for most categories of features. The MD TANet network structure proposed in this paper is capable of meeting the demands of daily high-resolution remote sensing video image data processing tasks, with an emphasis on innovation and practicality.
This paper proposes a multi-decoder triple attention model based on multi-task learning, and makes some progress in the task of recognizing multiple types of features in remote sensing images, but there are still some shortcomings, which need to be followed up with more in-depth research and exploration: (1) The research area of this paper has a simpler topographic environment and richer remote sensing resources, and the follow-up research can be carried out in the areas of complex topography and the areas where it is difficult to obtain high-quality remote sensing data due to the influence of clouds and rain. The following studies can be carried out in areas with complex terrain and where it is difficult to obtain high quality remote sensing data due to cloud and rain. (2) The feature extraction accuracy of the method in this paper still has some room for improvement.
