Vehicle Target Detection in Rainy and Foggy Scenes Based on Generative Adversarial Networks and Dynamic Fuzzy Compensation Techniques

Weather changes are also an important factor affecting traffic accidents, and bad weather is an important causative factor for traffic accidents [1]. In typical bad weather such as foggy and rainy conditions, the driver cannot clearly identify the direction of travel due to low visibility and insufficient sight distance during driving, and the control performance and stability of the car will be affected [2-3]. Compared with sunny day accidents, the consequences of traffic accidents caused by bad weather are more serious, and the casualty rate increases significantly, about 10 times of normal weather conditions [4].

How to reduce the incidence of traffic accidents and improve the safety performance of road transportation is a global problem that deserves the common attention of mankind, and is also a great challenge faced by scientific researchers and scholars and related practitioners around the world [5-6]. In order to reduce the occurrence of traffic accidents and improve the efficiency of vehicle traffic, automobiles are gradually developing in the direction of intelligence, lightweight, network connectivity, electrification, and sharing of travel modes [7]. With the development of intelligent transportation and smart cities, the development of intelligent driving cars and intelligent driver assistance systems has attracted great attention [8]. As a complex system integrating perception, cognition, planning and control functions, the development of intelligent driving car benefits from the increasing maturity of machine vision, sensor technology, artificial intelligence, automatic control technology and other related technologies [9-11]. With its good application prospects and broad potential market, intelligent driving cars have received financial investment and research and development support from many countries, and have made many breakthroughs in technology [12-13].

The environment sensing technology of road traffic is a crucial link for intelligent vehicles to realize intelligence, and it is also a basic guarantee to realize the safety and intelligence of intelligent transportation [14]. Environment sensing is equivalent to the “eyes” of intelligent vehicles, through the visual sensors to sense the surrounding environment when driving, the most commonly used visual sensors are cameras [15-16]. Compared with active ranging sensors, cameras can provide dense pixel information for the surrounding scene at a relatively low cost, which improves the accuracy of perceiving object categories and shapes, and is widely used in the fields of image processing and target detection [17-18]. Vehicle target is one of the main factors constituting the road traffic scene, and the detection of vehicles refers to the accurate identification and localization of vehicle targets from the acquired visual data [19]. Detection algorithms for vehicles are important component algorithms of visual perception systems, which are widely used in video surveillance, intelligent transportation and other fields [20]. Target detection is an important branch of image processing and computer vision, and one of the most important and indispensable tasks in automated driving, because detecting targets quickly and accurately can not only support precise navigation, but also deal with potential hazards in complex driving environments. It is also a prerequisite for other processing such as vehicle monitoring, vehicle type recognition, vehicle traffic flow statistics and license plate recognition in ITS [21-22].

In recent years, a large number of detection models based on deep convolutional neural networks have been introduced to improve target detection performance [23]. Although these detectors have achieved good accuracy under good weather conditions, in real life there is often a lot of bad weather, such as rain, snow, fog, sand and dust, etc., and the images acquired under these weather conditions will be degraded and blurred, which makes it difficult to extract effective features, so the detection accuracy of these detectors is greatly reduced when they work under these weather conditions, and it is estimated that almost 30% of all traffic accidents are caused by rain and fog weather conditions, which is 70% higher than the weather conditions, because in bad weather conditions, such as rain, snow and fog can interfere with the driver’s vision, so it leads to misjudgment of road situation information [24-27]. Therefore, rain and fog weather conditions have become an important factor affecting road safety, but this situation cannot be eliminated artificially, so how to improve the accuracy of vehicle target detection in rain and fog weather is extremely important for intelligent transportation management [28].

A dynamic blurred image processing method based on Wiener filter and generative adversarial network is first proposed. Using Wiener filter deblurring algorithm, noise is removed by mean square error minimization. Then a Generative Adversarial Network model (GAN) which is free and not distributed by predefined conditions is considered and a UNIT based de-fogging and de-raining algorithm is proposed which introduces an encoder and decoder architecture to capture more useful information. The loss function formulation is further modified by a new loss function to generate realistic and clear images. On this basis, an image de-raining assisted local perception enhanced vehicle detection model is constructed, and Swin Transformer is chosen as the design prototype of the backbone network, and the whole network is divided into two parts: image de-raining and fogging and target detection, and finally, the performance of the proposed method is verified by experiments on both simulated data and real scenarios.

2

Method

2.1

Image processing based on Wiener filter and generative adversarial network

2.1.1

Wiener filter deblurring algorithm

Filter-based deblurring is the estimation of a desirable clear image s from an observed blurred image u₀ by some filter estimator, w for the corresponding filter, as represented in equation (1): (1) $\hat{u} = {\hat{u}}_{w} = w \times u_{0}$

For a pure denoising process of a noisy image unaffected by blurring, linear filtering can be considered as a natural tool for noise suppression by convolution, and for deblurring, it can be considered as an attempt to remove the effect of a particular convolution operation by another convolution operation [29]. For example, without considering noise, it can be expressed in the form of a Fourier frequency domain, where the Fourier transform is known and the frequency domain quantity is denoted as ω = (ω₁, ω₂): (2) $W (ω) = \frac{1}{K (ω)}$

From the above equation, $\hat{u} = w \times u_{0} \equiv u$ is easy to realize for any clear image u. But usually a typical blurring kernel K is a low-pass filter, whose Fourier transform K(ω) tends to cause rapid decay in the high frequency part, leading to instability in its process and even making the recovered image seriously distorted.

In order to solve this instability in the recovery process, the above equation is improved as: (3) $W (ω) = \frac{K^{*} (ω)}{K (ω) K^{*} (ω)} = \frac{K^{*}}{| K |^{2}}$

where * denotes the conjugate of a complex number and an attempt is made to regularize the instability of the denominator when it is in the high-frequency part by adding a positive factor r = r(ω): (4) $W \to W_{r} = \frac{K^{*}}{| K |^{2} + r}$

Assuming that the estimate is recorded as ${\hat{u}}_{r}$ , the (5) ${\hat{u}}_{r} = W_{r} \times k \times u$

or in the Fourier frequency domain: (6) $W_{r} K (ω) = \frac{{| K (ω) |}^{2}}{| K {(ω)}^{2} | + r (ω)}$

In the low-frequency part of r ≪ |K|², the image recovery is approximately equal to that expected. At r ≫ |K|², the high-frequency part is severely distorted due to the near disappearance of k. Thus the regularization factor r acts as an equivalent of a threshold.

Since image noise affects deblurring, it is particularly important to select an optimal regularization factor, for which we use Wiener’s minimum mean square error for the original implementation.

2.1.2

Generating Adversarial Network Models (GAN)

Generative Adversarial Network (GAN) is a network that contains a class generator and a class discriminator. The class generator generates samples that “look similar to real samples” based on the input noise signal, and the class discriminator is used to distinguish between the samples generated by the class generator and real samples [30]. As an example, a photo is generated (G is the class generator and D is the class discriminator): 1)

G receives a random noise x through which a photo is generated, denoted as G(x).

2)

D(y) determines whether the photo is “real” or not. Its input parameter is y, y represents a photo, and the output result is D(y) represents the probability that y is a real photo, if it is 1, it means it is a real photo, and the output result is 0, it means it is not a real photo.

3)

Throughout the training process, the class generator tries to generate real photos to deceive the class discriminator. And the class discriminator tries to distinguish the real photos from the photos generated by the class generator. This realizes a dynamic “game process”.

As the latest form of machine learning, Generative Adversarial Networks (GANs) have the advantages of high-definition output images, high sharpness, and universality to both generators and discriminators compared to general neural networks. Compared to other generative models, GANs no longer require a predefined data distribution and have maximum freedom of fit.

2.1.3

Design of the Wiener filtering algorithm

Wiener filtering hypothesis de-blurring estimation results Image $\hat{u}$ is the result of the filtering process of the blurred image u₀ with the optimal filter w(X): (7) ${\hat{u}}_{w} = w \times u_{0}$

Wiener filter w(x) is such that the mean square estimation error: (8) $l_{w} (X) = {\hat{u}}_{w} (X) - u (X)$

Reach the smallest optimal filter: (9) $w = \arg_{h} \min E [e_{k}^{2}] = \arg_{k} \min E {(h * u_{0} (X) - u (X))}^{2}$

which generates orthogonal conditions: (10) $E [[(w \times u_{0} (X) - u (X) u_{0} (Y))]] = 0, \forall X, Y \in Ω$

can be transformed as per the correlation function: (11) $w \times R_{u 0 u 0} (Z) = R_{u u 0} (Z), Z \in R^{2}$

This leads to the explicit form of the optimal Wiener filter: (12) $W (ω) = \frac{R_{u u 0} (Z)}{R_{u 0 u 0} (Z)}$

For fuzzy graph u₀ = k × u + n, there is: (13) $S_{u u 0} = \frac{K^{*} (ω) S_{u u}}{| K |^{2} S_{u u} + S_{n n}} = \frac{K^{*}}{| K |^{2} + r}$

where the regularization factor r = sm/suu is the square of the noise-to-signal ratio.

2.2

UNIT-based de-fogging and de-rain algorithm

2.2.1

UNIT framework

The UNIT structure is schematically shown in Fig. 1. The UNIT network can be formally viewed as a combination of two VAE/GAN models. The network consists of three main parts: the autoencoder E₁, E₂, the generative model G₁, G₂, and the discriminative model D₁, D₂, x₁, x₂ representing the pictures in the source and target domains, respectively. For the unsupervised image style conversion task, samples can only be obtained from the respective edge distributions. However, the joint distribution can be obtained from the known edge distributions with countless possibilities, so it is not possible to obtain the joint distribution from the edge distributions without assumptions. Therefore, based on the theory of “shared potential space”, it is considered that any inputs x₁ and x₂ have a common potential code z in a shared potential space, and the image of the corresponding domain can be generated or restored to the original image through the common potential code z.

As shown in (a), there are two different styles of picture domains in UNIT, the source domain X₁ and the target domain X₂. In the UNIT network model, encoders E₁ and E₂ are used to realize the mapping of the input pictures x₁ and x₂ from different domains to a certain shared potential space and encode them into a uniform potential code z, with z = E₁(x₁) = E₂(x₂). The generative models G₁ and G₂ are used to realize the conversion of the potential codes z into the corresponding domains’ pictures through different generative models. pictures, and for generative model G₁, whose input is the latent code z, it can be realized to map the latent code z obtained from X₁ the domain into the source domain X₁, and also the latent code z obtained from X₂ the domain into the source domain X₁. Like the discriminative models in the original GAN, the discriminative models D₁ and D₂ are used to determine the authenticity of the images, i.e., to determine whether the input images are real samples or samples generated by the generative model.

As shown in (b), X₁ picture x₁ in the picture domain is obtained by encoder E₁ with potential encoding z, which potential encoding z can be mapped back to the X₁ domain by generative model G₁ to obtain self-reconstructed image $x_{1}^{1 - 1}$ , and can also be obtained by generative model G₂ to obtain domain-variant image $x_{1}^{1 - 2}$ . Similarly, picture x₂x₁ in the domain is obtained by encoder E₂ with potential encoding z, which potential encoding can be obtained by generative model G₁ to obtain domain-variant after image $x_{2}^{2 - 1}$ and also image $x_{2}^{2 - 2}$ can be obtained by generative model G₂.

2.2.2

UNIT-based de-fogging and de-rain algorithm

1)

Network Structure

In this section, the proposed VAE-CoGAN de-fogging and de-raining model can better handle images in foggy and rainy scenes without preclassifying the images. VAE-CoGAN consists of three parts: encoder, generative model, and discriminative model.

The encoder is responsible for converting the input picture codes into vector form to be used as input for generative models in the GAN. For a blurred picture x₁ with fog or with rain patterns in the source domain and a clear picture x₂ in the target domain, the same potential encoding z is obtained by mapping the picture (x₁, x₂) to the shared potential space using encoders E₁ and E₂, with z = E₁(x₁) = E₂(x₂).

The generative model is responsible for realizing the conversion of the potential encoding z obtained by the encoder conversion into a picture. For the generative model G₁, the generative model yields generative pictures $x_{1}^{1 - 1}$ as well as $x_{2}^{2 - 1}$ , where the generative picture $x_{2}^{2 - 1}$ is a blurred picture obtained by converting a clear picture in the target domain X₂, while the generative picture $x_{1}^{1 - 1}$ is a self-reconstructed picture that has not been converted to a style, and does not help in the task of converting the image styles investigated in this chapter. Similarly, for generative model G₂, whose input is latent encoding z, generative images $x_{1}^{1 - 2}$ and $x_{2}^{2 - 2}$ can be obtained through generative model G₂, where the generated image $x_{i}^{1 - 2}$ is a clear picture transformed from a blurred image in the source domain X_i, that is, the clear picture obtained by the foggy or rainy picture that is intended to be obtained in this chapter, and the generated picture $x_{2}^{2 - 2}$ is a self-reconstructed picture without style conversion, which is not helpful for the task of image style conversion studied in this chapter.

2)

Loss function

The objective function of this network consists of four components: the VAE loss, the GAN loss, the cyclic consistency loss, and the VGG perceptual loss. The objective function is as follows: (14) $\begin{array}{l} L (E_{1}, E_{2}, G_{1}, G_{2}, D_{1}, D_{2}) \\ = L_{V A E_{1}} (E_{1}, G_{1}) + L_{V A E_{2}} (E_{2}, G_{2}) \\ + L_{G A N_{1}} (E_{2}, G_{1}, D_{1}) \\ + L_{G A N_{2}} (E_{1}, G_{2}, D_{2}) \\ + L_{C C_{1}} (E_{1}, G_{1}, E_{2}, G_{2}) \\ + L_{C C_{2}} (E_{2}, G_{2}, E_{1}, G_{1}) \\ + L_{c o n t e n t} (E_{1}, E_{2}, G_{1}, G_{2}) \end{array}$

The VAE loss function aims to minimize the objective function, and its loss function consists of two components, regularization and reconstruction error: $L_{V A E} = L_{p n o r} + L_{l l i k e}^{p i x e l}$ .

where $L_{p r i o r} = K L (q (z | x) | | p_{η} (z)), L_{l l i k e}^{p i x e l} = - E_{q (z | x)} [\log p (x | z)]$ .

Regularization provides a simple way to sample from the latent space, and minimizing the negative log-likelihood term in the reconstruction error is equivalent to minimizing the absolute distance between the image and the reconstructed image. This chapter uses two auto-variable encoders with: (15) $L_{V Λ E_{1}} (E_{1}, G_{1}) = λ_{4} K L (q_{1} (z_{1} | x_{1}) | | p_{η} (z)) - λ_{5} E_{z_{1} ~ q_{1} (z_{1} | x_{1})]} [\log p_{G_{1}} (x_{1} | z_{1} |)]$ (16) $L_{V A E_{2}} (E_{2}, G_{2}) = λ_{4} K L (q_{2} (z_{2} | x_{2} |) | | p_{η} (z)) - λ_{5} E_{z_{2} - q_{2} (z_{2} | x_{2})]} [\log p_{G_{2}} (x_{2} | z_{2} |)]$

where λ₄ and λ₅ are hyperparameters that control the weights of the two items. the KL divergence term indicates the distance between q(z ∣ x) and p_η(z), and the smaller the KL value, the smaller the distance.

In this model, the GAN loss function is used to ensure that the generated images are as similar as possible to the images in the target domain: (17) $L_{G A N_{1}} (E_{2}, G_{1}, D_{1}) = λ_{0} E_{x_{1} ~ p_{q}} [\log D_{1} (x_{1})] + λ_{0} E_{z_{2} ~ q_{2} (z_{2} | x_{2})} [\log (1 - D_{1} (G_{1} (z_{2})))]$ (18) $L_{G A N_{2}} (E_{1}, G_{2}, D_{2}) = λ_{0} E_{x_{2} ~ p_{x_{2}}} [\log D_{2} (x_{2})] + λ_{0} E_{z_{1} ~ q_{1} (z_{1} | x_{1})} [\log (1 - D_{2} (G_{2} (z_{1})))]$

Pictures $x_{2}^{2 - 1}$ and $x_{1}^{1 - 2}$ with style transformations performed in generative model G_i and generative model G₂ are examined, there are $x_{2}^{2} - 1 = F_{2 - 1} (x_{2}) = G_{1} (E_{2} (x_{2}))$ and $x_{1}^{1 - 2} = F_{1 - 2} (x_{1}) = G_{2} (E_{1} (x_{1}))$ . The experimental results are made more realistic by incorporating the idea of cyclic consistency, there are: x₁ = F₂₋₁(F₁₋₂(x₁)), x₂ = F₁₋₂(F₂₋₁(x₂)). For cyclic consistency loss there are: (19) $\begin{array}{l} d L_{c c_{1}} (E_{1}, G_{1}, E_{2}, G_{2}) \\ = λ_{6} K L (q_{1} (z_{1} | x_{1}) | | p_{η} (z)) \\ + λ_{6} K L (q_{2} (z_{2} | x_{1}^{1 \to 2}) | | p_{η} (z)) \\ - λ_{7} E_{z_{2} - q_{2} (z_{2} | x_{1}^{1 \to 2} |)} [\log p_{G_{1}} (x_{1} | z_{2})] \end{array}$ (20) $\begin{array}{l} L_{e e_{2}} (E_{2}, G_{2}, E_{1}, G_{1}) \\ = λ_{6} K L (q_{2} (z_{2} | x_{2} |) | | p_{η} (z)) \\ + λ_{6} K L (q_{1} (z_{1} | x_{2}^{2 \to 1}) | | p_{η} (z)) \\ - λ_{7} E_{z_{1} ~ q_{1} (z_{1} | x_{2}^{2 \to 1})]} [\log p_{G_{2}} (x_{2} | z_{1}))] \\ - λ_{7} E_{z_{1} ~ q_{1} (z_{1} | x_{2}^{2 \to 1})]} [\log p_{G_{2}} (x_{2} | z_{1} |)] \end{array}$

The formula for VGG loss is: (21) $\begin{array}{l} L_{c o n t e n t} (E_{1}, E_{2}, G_{1}, G_{2}) \\ = E_{x_{1} ~ p_{x_{1}}} [| | φ_{i} (G_{1} (E_{2} (G_{2} (E_{1} (x_{1}))))) - φ_{i} (x_{1}) | |_{0}] \\ + E_{x_{2} ~ p_{x_{2}}} [| | φ_{i} (G_{2} (E_{1} (G_{1} (E_{2} (x_{2})))))) - φ_{i} (x_{2}) | |_{1}] \end{array}$

where φ_i is the activation of layer i of the CNN.

The final goal in this network is: (22) $G^{*}, F^{*} = \arg \min_{G, F} \max_{D_{X 1}, D_{X 2}, D_{Y 1}, D_{Y 2}} L (G, F, D_{X 1}, D_{X 2}, D_{Y 1}, D_{Y 2})$

Here λ₀ = 10, λ₄ = 0.1, λ₅ = 100, λ₆ = 0.1, λ₇ = 100.

2.3

Image De-Rain Assisted Local Perception Enhanced Vehicle Detection Model

2.3.1

Detection network structure

This section focuses on the network structure of the vehicle detection part and proposes a detection framework with local perceptual enhancement compared to the most commonly used CNN-based detection algorithms. Unlike the detection backbone network used for foggy images, the model feeds the features extracted from the de-raining part to the local perceptual enhancement Transformer backbone network, and the vehicle detection network is shown in Fig. 2. Similar to Swin Transformer, each stage has 2, 2, 6 and 2 blocks respectively.

First, given a rainy day image of size H × W × 3, the input image H × W × 3 is partitioned into a set of non-overlapping image patches by image block partitioning, where each image patch has size 4 × 4, feature dimension 4 × 4 × 3, and number $\frac{H}{4} \times \frac{W}{4}$ , and then by a linear embedding, where each patch is regarded as a vector “token”, and its features are viewed as a Each patch is then treated as a vector “token” through a linear embedding, and its features are viewed as a concatenation of the original pixel RGB values. After changing the feature dimension of the segmented patch token to 4 × 4 × C, it is fed into multiple locally aware Swin Transformer blocks for encoding, which contain blocks at different stages. After that the input is merged by image block merging the neighboring image blocks according to 2 × 2. This changes the number of image blocks to $\frac{H}{8} \times \frac{W}{8}$ and the feature dimension to 4C. Repeat this step for N times for each stage until the number of image block chunks becomes $\frac{H}{32} \times \frac{W}{32}$ and the feature dimension becomes 8C. Finally it is fed to the regression header for target classification and localization regression. Each stage consists of an image block merging block or a linear embedding block and a stacked local perceptual enhancement Transformer block.

2.3.2

Local Perception Enhancement Transformer Block (LPST)

The positional coding in Transformer easily fails to detect local correlation and structural information in the image, Swin-Transformer m used a window-based hierarchical structure to solve the scaling problem and high computational complexity of high-resolution images. Each Swin-Transformer block consists of a normalization layer, a multi-head self-attention module, residual connectivity, and a multilayer perceptron (MLP), and has two fully connected layers with GELU nonlinearity. The window-based multi-head self-attention (W-MSA) module and the shift-window-based multi-head self-attention (SW-MSA) module are applied in two consecutive Transformer blocks, respectively.

Although Swin Transformer constructs a hierarchical Transformer and implements attentional operations in each non-overlapping window, Swin transformer is limited in its ability to encode contextual information, in order to enhance the network’s learning of local relevance and structural information, this paper proposes a locally-aware augmented Transformer. Each block is composed of two consecutive improved Transformer blocks.

The standard convolution kernel dilation process can be viewed as the spacing of the values of the convolution kernel when doing data processing. Expansion convolution introduces hyperparameter i.e. expansion rate r to the convolutional layer, and the expansion rate r of ordinary convolution is 1, i.e. r = 1 when the expansion convolution is standard convolution. The expansion process of dilation convolution can be seen as the convolution kernel horizontal and vertical neighboring weights between the addition of the void parameter 0, the number of intervals is (k − 1), each interval to join the number of 0’s is r − 1. The addition of the void parameter 0 does not need to participate in the convolution operation. So the relationship between the size of the dilated convolution kernel is shown in equations (23) and (24): (23) $k_{d} = k + (k - 1) \times (r - 1)$ (24) $k_{d} = k + (r - 1) \times (k - 1)$

where k_d represents the size of the dilation convolution kernel, k represents the size of the standard convolution kernel, and r represents the dilation rate. In particular, when r is set to 1, there is no null parameter, i.e., it is a standard convolution. Usually, the dilation convolution which has the same parameters as the standard convolution has more weight parameters and larger size to capture the global information in the image through the enlarged sensory field and discretized weight parameters [31].

Further after the introduction of dilated convolution, the size of the corresponding output feature map is calculated as in equation (25): (25) $0 = [\frac{i + 2 p - k - (k - 1) * (r - 1)}{s}] + 1$ $$0 = \left[ {{{i + 2p - k - (k - 1)*(\>r\> - 1)} \over s}} \right] + 1$$

Where, o represents the output feature map size, i represents the image size size of the input null convolution, p represents the input feature map boundary padding, and s represents the step size. The above formula shows that if the convolution kernel are after dilation convolution operation and maintain its step size, boundary pixel padding and other parameters unchanged, its input the same feature map, after the convolution operation, the size of the output feature map will be correspondingly smaller.

3

Results and discussion

3.1

Experiments on robust rain removal methods for multiple rain degradation types

3.1.1

Experimental setup details

1)

Comparison methods

The algorithm in this paper will be compared with three different classes of rain removal methods, including (1) one raindrop removal method, AttentGAN.(2) six rain pattern removal methods, including DetailNet, RESCAN, PReNet, PReNet, JORDER-E, RCDNet and RLNet.(3) two robust rain removal methods, including Pix2pix and CCN. In addition to the algorithms in this paper, the RadNet-algorithm is also tested in this paper.

2)

Datasets

In this paper, four datasets are selected as the training and testing datasets, including: Rain200H, Rain200L, RainDrop and RainDS. RainDS contains three synthetic data subsets (RS_syn, RD_syn and RDS_syn) and three real data subsets (RS_real, RD_real and RDS_ real). Based on the above datasets, three data strategies are designed to test the robustness of the rain removal method:(1) Single-type data strategy, i.e., the dataset is a single dataset and each image contains only one type of rain degradation, and the eligible datasets include the single rain pattern dataset (Rain200H, Rain200L, RS_syn and RS_real) and the single raindrop dataset ( RainDrop, RD_syn and RD_real). (2) Stacked data strategy, i.e., the dataset is a single dataset but contains two rain degradation types in each image, the eligible datasets include RDS_syn and RDS_real.(3) Hybrid data strategy, i.e., multiple datasets are mixed together, and each image may contain a single degradation or multiple degradation types, and three hybrid datasets are constructed Blended- 1={RD_syn+RS_syn+RDS_syn}, Blended-2={RD_real+RS_real+RDS_real} and Blended-3={Rain200H+Rain200L+RainDrop}. In addition to this, real scene data was collected from the Internet and previous work to construct real datasets as benchmarks to examine the de-emphasis ability of each method in real scenes. Peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) are used to evaluate the paired data. For unlabeled data, comparisons were made based on visualization results.

3)

Training details

The algorithms in this paper were trained on two NVIDIA GeForce GTX 3080 GPUs with 24 GB of RAM using the Pytorch deep learning framework in a Python environment. Adam was chosen as the optimizer, and the weight decay and momentum were set to 0.0001 and 0.9, respectively. The initial learning rate is set to 1e-3 for RAM and DRM and 1e-6 for FWM, and the learning rate will be decayed by multiplying 0.2 every 30 epochs. Each image will be randomly cropped to 128×128 pixels. 100 epochs are trained to converge the network and the batch size is set to 16.

3.1.2

Simulation experiment results and analysis

1)

Experimental results and analysis on single-type data

Firstly, the performance of all methods is tested under single type strategy, and the quantitative evaluation results on single type dataset (rain pattern) are shown in Table 1, and on single type dataset (raindrop) are shown in Table 2. It can be seen that (1) the performance of this paper’s algorithm is much better than other methods on two real datasets, RS_real and RD_real. (2) Compared with CCN, which is also a robust method, this paper’s algorithm obtains better performance on all but the RainDrop dataset. Specifically, the PSNR of this paper’s algorithm in RS_syn is 5 dB higher than that of CCN.(3) The performance of this paper’s algorithm on the RainDrop dataset is poorer than that of CCN, which may be attributed to the fact that CCN has two independent modules to process rainbars and raindrops separately, so that it can deal with the RainDrop data that contains large blurring regions. (4) In terms of the average score, the algorithm in this paper has a very competitive performance.

Table 1.

Quantitative evaluation results on single-type dataset (rain streak)

Method	Rain streak				Average
Method	Rain200H	Rain200L	RS_syn	RS_real	Average
AttentGAN	22.98/0.73	28.29/0.885	27.53/0.658	24.42/0.425	25.81/0.675
DetailNet	26.34/0.835	34.41/0.869	30.86/0.686	26.18/0.84	29.45/0.808
RESCAN	26.71/0.114	37.02/0.723	38.64/0.982	26.36/0.938	32.18/0.689
PReNet	28.17/0.869	36.82/0.951	39.37/0.969	25.93/0.998	32.57/0.947
JORDER-E	29.51/0.741	39.29/0.975	40.16/0.737	26.31/0.603	33.82/0.764
RCDNet	30.77/0.775	39.79/0.286	44.16/0.833	27.24/0.991	35.49/0.721
RLNet	29.47/0.787	38.36/0.992	37.06/0.966	26.85/0.755	32.94/0.875
Pix2pix	24.1/0.496	29.78/0.943	28.06/0.815	24.9/0.937	26.71/0.798
CCN	28.98/0.722	37.86/0.864	35.1/0.904	26.81/0.901	32.19/0.848
RadNet -	30.38/0.995	38.74/0.985	39.17/0.891	26.67/0.773	33.74/0.911
Ours	30.23/0.985	38.56/0.855	39.57/0.201	27.69/0.929	34.01/0.743

Table 2.

Quantitative assessment results on single-type data sets (raindrops)

Method	Raindrop			Average
Method	RainDrop	RD_syn	RD_real	Average
AttentGAN	30.6/0.932	27.26/0.87	21.75/0.669	26.54/0.824
DetailNet	25.02/0.594	28.42/1.262	22.13/0.821	25.19/0.892
RESCAN	25.54/0.953	34.45/0.885	23.03/0.584	27.67/0.807
PReNet	25.6/0.526	34.92/0.8	23.66/0.465	28.06/0.597
JORDER-E	26.62/0.989	35.55/0.832	23.83/0.918	28.67/0.913
RCDNet	26.28/0.887	35.18/0.991	24.36/0.837	28.61/0.905
RLNet	26.6/0.518	33.28/0.921	23.85/0.726	27.91/0.722
Pix2pix	25.55/0.935	25.07/0.493	20.46/0.779	23.69/0.736
CCN	31.49/0.871	33.45/0.815	24.63/0.888	29.86/0.858
RadNet -	24.66/0.945	35.4/0.922	23.69/0.832	27.92/0.900
Ours	24.09/0.475	35.61/0.582	28.25/0.943	29.32/0.667

2)

Experimental results and analysis of superimposed type data

The task of de-raining under this data strategy is more difficult than the single-type strategy, mainly because the network needs to synchronize the processing of rain patterns and raindrops. The quantitative evaluation results on the superimposed-type data set are shown in Table 3. From the PSNR/SSIM results in the table, it can be seen that (1) the algorithm in this paper obtains the best performance on both synthetic dataset (RDS_syn) and real dataset (RDS_real). Compared with CCN, the algorithms in this paper have 2dB PSNR improvement on RDS_syn and 4dB PSNR improvement on RDS_real. (2) All the methods perform poorly on RDS_real data, which is mainly due to the fact that the image pairs in it do not correspond at the pixel level, which makes it difficult for all the supervised methods. And based on the effectiveness of the proposed FWM, the algorithm in this paper can still achieve excellent results.

Table 3.

Quantitative evaluation results on superimposed-type dataset

Method	RDS_syn	RDS_real	Average
AttentGAN	24.91/0.816	21.03/0.999	22.97/0.908
DetailNet	26.56/0.814	22.43/0.19	24.50/0.502
RESCAN	31.65/0.898	21.66/0.446	26.66/0.672
PReNet	32.79/0.782	22.8/0.893	27.80/0.838
JORDER-E	33.3/0.372	23.09/0.352	28.20/0.362
RCDNet	34.18/0.744	23.33/0.817	28.76/0.781
RLNet	32.29/0.861	23.73/0.581	28.01/0.721
Pix2pix	23.78/0.658	20.16/0.652	21.97/0.655
CCN	32.15/0.98	22.81/0.805	27.48/0.893
RadNet -	34.25/0.981	23.55/0.908	28.90/0.945
Ours	34.07/0.801	27.06/0.807	30.57/0.804

3)

Experimental results and analysis of hybrid data

The de-raining task under this data strategy is more difficult than the above two strategies, not only needing to synchronize the two degenerate phenomena of rain patterns and raindrops, but also suffering from the fitting problems caused by the different distributions of the datasets. The results of the quantitative assessment on the hybrid dataset are shown in Table 4. From the PSNR/SSIM results in the table, it can be seen that: this paper’s algorithm achieves the best performance, with 1dB PSNR over RCDNet on the Blended-1 dataset, while on the Blended-2 dataset, this paper’s algorithm outperforms RCDNet by close to 3dB PSNR and 0.011dB SSIM. From the results, it can be seen that the algorithm in this paper is 1dB PSNR higher than CCN*.

Table 4.

Quantitative evaluation results on blended-type dataset

Method	Blended-1	Blended-2	Blended-3	Average
AttentGAN	26.09/0.414	22.64/0.446	23.91/0.916	24.21/0.592
DetailNet	27.15/0.893	23.52/0.905	23.78/0.803	24.82/0.867
RESCAN	33.16/0.776	23.49/0.897	28.56/0.987	28.40/0.887
PReNet	34.19/0.929	23.96/0.982	29.88/0.791	29.34/0.901
JORDER-E	34.99/0.828	24.2/0.812	28.23/0.822	29.14/0.821
RCDNet	35.54/0.725	24.49/0.902	29.49/0.761	29.84/0.796
RLNet	35.54/0.873	25.31/0.908	30.61/0.924	30.49/0.902
Pix2pix	24.59/0.73	22.49/0.581	24.53/0.68	23.87/0.664
RadNet -	36.65/0.882	24.27/0.808	30.74/0.708	30.55/0.799
Ours	36.82/0.582	30.02/0.947	30.14/0.888	32.33/0.806
CCN*	33.6/0.735	24.58/0.561	32.91/0.592	30.36/0.629
RadNet -*	36.37/0.931	24.75/0.872	31.34/0.948	30.82/0.917
Ours*	36.39/0.941	27.49/0.913	31.21/0.958	31.70/0.937

The performance of the different methods under the three data strategies is shown in Fig. 3 (Fig. a shows the results of PSNR values and Fig. b shows the results of SSIM values). It can be found that (1) the algorithm in this paper is only slightly weaker than RCDNet on single-class data strategy, while the best performance is obtained in all other cases. (2) This paper’s algorithm improves very significantly on the stacked data strategy and hybrid data strategy, which mainly test the robustness of each method, and thus the robustness of this paper’s algorithm can be fully verified. (3) The algorithm in this paper is far better than another robust rain removal method CCN.

3.2

Experiments on target detection performance of this paper’s model in rainy and foggy scenes

3.2.1

Model training

1)

Dataset library construction

When performing the target detection task under rain and fog conditions, it is crucial to obtain a sufficient number of rain and fog dataset samples. However, there are relatively few rain and fog datasets under public traffic conditions, which poses a challenge to the performance improvement of deep learning network models under rain and fog conditions. To address this problem, this paper uniformly generates fog-containing data samples with different concentrations based on the atmospheric scattering model using the BDD100K dataset as a basis, a process that allows the dataset to contain images of foggy conditions with different concentrations, ranging from light fog to dense fog, covering a variety of foggy conditions. However, the samples generated by relying only on the atmospheric scattering model have certain singularities and limitations. To further enrich the dataset, an adversarial generative network is employed to generate more realistic and diverse fog-containing images, effectively expanding the size and diversity of the dataset. At the same time, real rainy day condition images are manually collected and labeled to enrich the training data, and a rainy fog condition dataset integrated with real, physical models and adversarial generative networks is constructed. The rain and fog images generated by this method have different concentration differences, which improves the breadth and abundance of the dataset and provides more abundant data resources for the subsequent target detection model based on convolutional neural network.

2)

Model parameter setting

The model in this paper is implemented using Pytorch framework and trained based on NVIDIA RTX 3090 GPU. During the training process, the original image size is 640 360, which is resized to 640640 during input and training of the model. The original size is maintained during testing. The initial learning rate is 0.01 and the optimizer uses stochastic gradient descent (SGD). The momentum parameter and weight decay parameter are set to 0.937 and 0.0005 respectively, the batchsize is 8, the number of training rounds is 200, and the data enhancement strategy uses Mosaic and image left-right flip data enhancement strategies as a way of enriching the dataset and enhancing the generalization of the model.

3.2.2

Experimental results of data derivation and sensory field amplification for rain and fog scenarios

1)

Experimental results and analysis of the effectiveness of data derivation for rain and fog scenes

In order to verify the effectiveness of the data augmentation method, by inputting the Haze Sim dataset, the GANHaze dataset and the AtmoGAN Haze dataset into the model network model of this paper respectively, and using mAP0.5 as the evaluation index, the data augmentation results are shown in Table 5. As can be seen from the table, the dataset achieves 75.6%, 54.8%, and 67.1% in precision (P), recall (R), and mean category accuracy (mAP), respectively, and the AtmoGAN Haze dataset after data expansion improves the detection effect by 4.3% relative to the un-expanded HazeSim dataset, and improves the detection effect by 1.1% relative to the GANHaze dataset Compared to the pre-expansion dataset, the AtmoGAN Haze dataset has better performance.

Table 5.

Data amplification

Data set	P	R	mAP
AtmoGAN Haze	75.6%	54.8%	67.1%
HazeSim	72.8%	49.8%	62.8%
GANHaze	67.3%	45.5%	66%
Fog Traffic	85.5%	58.9%	72.7%

The detection accuracy in each category is shown in Table 6. Secondly, the detection accuracy in each category shows that the AtmoGAN Haze dataset achieves relatively high mAP for pedestrians, riders, cars, buses, trucks, bicycles, and motorcycles. mAP for pedestrian detection reaches 48.1%, which achieves the highest detection result compared to the unexpanded dataset. It proves that the expanded dataset can detect pedestrians better. For the detection accuracy of riders, cars, buses and trucks, AtmoGAN Haze also achieves 71.1%, 80.9%, 66.8% and 60.4%, respectively, which maintains the leading position compared to the unexpanded dataset. This result demonstrates that the fog-containing dataset augmented with different methods enhances the generalization of the model, and also highlights the superiority and usefulness of the AtmoGAN Haze dataset in image processing tasks, which provides an important reference for further research and applications.

Table 6.

No detection accuracy

Data set	Pedestrian	Rider	Car	Bus	Waggon	Bicycle	Motorcycle
AtmoGAN Haze	48.1%	71.1%	80.9%	66.8%	60.4%	64.2%	73.7%
HazeSim	44.8%	67.3%	68.2%	71.9%	57.7%	60.2%	66.7%
GANHaze	57.6%	60.2%	86.3%	54.3%	51%	27.2%	57%
Fog Traffic	70.6%	70.4%	81.4%	84.6%	77.2%	65.2%	62.7%

In order to achieve the same good detection results of the model under rainy conditions, the 750 dataset obtained from Rain dataset, 500 images were added to the training set of AtmoGAN Haze dataset, and the other 250 images were added to the test set to form the rainy and foggy condition dataset HydroFogRain for this study. This dataset was constructed to ensure that the model has robust target detection performance in the rainy and foggy conditions with robust target detection performance. Based on the HydroFogRain dataset, the network model of this paper’s model with sensory field amplification is compared with the current mainstream model YOLOV5, and the YOLOV5 comparison results are shown in Table 7. It can be seen that the precision rate of this paper’s model reaches 77.2%, which is higher than that of YOLOV5, indicating that this paper’s model is more accurate in predicting positive samples. Secondly, the recall rate reaches 53.8%, which is also higher than YOLOV5, indicating that this paper’s model can better capture the real positive examples in the positive samples, and the mAP of this paper’s model reaches 66.2%, which is significantly higher than YOLOV5, indicating that this paper’s model performs better in terms of comprehensive performance.

Table 7.

YOLOV5 comparison results

Model	P	R	mAP	Parameter quantity	GFLOPS
YOLOV5	76.9%	53%	56.5%	45.3M	109.4%
Ours	77.2%	53.8%	66.2%	54.4M	302.1%

The detection accuracy for each category is shown in Table 8. As can be seen from the table, for various different traffic scenarios, this paper’s model achieves relatively high detection accuracies for pedestrians, riders, cars, buses, trucks, bicycles and motorcycles. First, the detection accuracy of this paper’s model is significantly better than that of YOLOV5 for pedestrians and riders, reaching 53.1% and 73.5%, respectively, compared with only 44.5% and 59.1% for YOLOV5, indicating that this paper’s model is more capable of accurately detecting pedestrians and riders in complex situations. Second, this paper’s model performs equally well on other categories. For example, in the detection accuracy of targets such as cars, buses and bicycles, this paper’s model achieves 80.9%, 62.9% and 66.5%, respectively, which is higher than YOLOV5’s 80.3%, 56% and 46.6%. It shows that this paper’s model has better detection performance for different traffic targets. Finally, the detection accuracy of this paper’s model is also higher than that of YOLOV5 for trucks and motorcycles, which are 60.3% and 75.2%, respectively. In summary, the model in this paper performs well in the detection accuracy of multiple categories, which is especially suitable for the target detection task in complex scenes.

Table 8.

No detection accuracy

Model	Pedestrian	Rider	Car	Bus	Waggon	Bicycle	Motorcycle
YOLOV5	44.5%	59.1%	80.3%	56%	56.9%	46.6%	66.5%
Model of this paper	53.1%	73.5%	80.9%	62.9%	60.3%	66.5%	75.2%

2)

Comparison of mainstream target detection models

In order to verify the effectiveness of the detection model, the proposed target detection model with sensory field amplification This paper’s model is compared with some mainstream deep learning networks based on the HydroFogRain dataset, including Faster R-CNN, SSD, YOLOV3, YOLOV3-SPP, YOLOV4, YOLOV7, YOLOV8, and DETR, and the comparative experiments are shown in Table 9, which clearly shows that the proposed model algorithm in this paper has a much higher mAP than other mainstream detection models and almost twice as much as SSD when compared to other models. Although FasterR-CNN represents a two-stage detection algorithm with better detection accuracy, the detection accuracy is far less than that of this paper’s model and is not applicable to real-time applications. In contrast, YOLOV3, YOLOV3-SPP and YOLOV4 have faster detection speeds but lower detection accuracy and overall poor results. Compared with YOLOV5X, the model in this paper performs well in terms of detection precision and recall, while the number of parameters is much smaller than that of YOLOV5X. Compared with YOLOV7, YOLOV8, and DETR, the model in this paper shows the best performance in terms of precision, recall, and detection speed, although the number of parameters does not dominate. Taken together, the target detection algorithm proposed in this study for this paper’s model achieves the best level in terms of detection precision, and although the detection speed is slightly reduced compared to some YOLO series models, it is still the optimal choice in terms of comprehensive performance.

Table 9.

Comparison experiment

Model	P	R	mAP	Parameter	FPS
Faster R-CNN	50.2%	59.6%	57.9%	62M	15.5
SSD	33.2%	44.3%	38.8%	68.6M	31.3
YOLOV3	75.8%	52.2%	59.5%	61.4M	49.7
YOLOV3_SPP	73.5%	46.6%	54.3%	64.5M	43.8
YOLOV4	69.5%	45.9%	48.8%	63.9M	50.5
YOLOV5X	78.9%	53.7%	57.8%	85.8M	27.2
YOLOV7	75.9%	53.5%	59.4%	34.4M	42.2
YOLOV8	70.1%	51.6%	59.6%	41.8M	34.1
DETR	61.8%	44%	46.7%	31.7M	28.8
YOLO-Z	71.5%	47.7%	52.8%	55.6M	28.2
Ours	77.1%	52.4%	68.2%	53.2M	35.5

3.3

Vehicle color detection results

In order to discuss the vehicle color recognition performance under rainy conditions, quantitative evaluations and qualitative comparisons of the methods in this paper, Da-Faster, SA-Da-Faster, and SMNN-MSFF were performed. To ensure the fairness of the comparison experiments, all settings are compared against the original text. The three algorithms of this paper method, Da-Faster and SA-Da-Faster, train network models on the labeled source domain dataset Vehicle Color-24 and the unlabeled target domain dataset Rain Vehicle Color-24. In this case, there are 8094 images in the training set of Vehicle Color-24 and 8194 images in the training set of Rain Vehicle Color-24, SMNN-MSFF is trained using the training set of Rain Vehicle Color-24 dataset, and after the training is completed all the algorithmic models are trained on the Rain Vehicle Color-24 on 576 test set images and the detection accuracy of different algorithms on Rain Vehicle Color-24 for each category is shown in Table 10. The table shows the AP values for each color category and finally the average AP values for all the color categories are given. The experimental results in the table show that the method proposed in this paper has the highest mAP value compared to other state-of-the-art unsupervised domain-adapted target detection methods and vehicle color recognition algorithms, and the mAP values are 3.73%, 2.23%, and 1.19% higher than Da-Faster, SA-Da-Faster, and SMNN-MSFF, respectively, and it can be very good in improving the vehicle color recognition under rainy weather condition accuracy of the task. Overall, the method in this paper can reduce the domain differences of the model in the target domain and improve the localization accuracy.

Table 10.

Detection accuracy

Algorithm	Da-Faster	SA-Da-Faster	SMNN-MSFF	Ours
White	0.78	0.63	0.64	0.77
Black	0.8	0.52	0.65	0.73
Orange	0.8	0.83	0.75	0.77
Silver grey	0.85	0.27	0.34	0.54
Grass green	0.69	0.8	0.84	0.82
Deep grey	0.73	0.27	0.45	0.54
Scarlet	0.77	0.65	0.49	0.8
Gray	0.17	0.04	0.25	0.12
Red	0.59	0.65	0.46	0.61
Green color	0.76	0.85	0.6	0.75
champagne	0.57	0.17	0.33	0.28
Dark blue	0.68	0.39	0.54	0.55
Blue	0.72	0.58	0.59	0.74
Dark brown	0.45	0.07	0.38	0.27
Brown	0.27	0.38	0.32	0.21
Yellow	0.52	0.66	0.31	0.31
Lemon yellow	0.87	0.96	0.57	0.43
Dark orange	0.61	0.63	0.34	0.99
Dark green	0.37	0.3	0.57	0.09
Salmon	0.27	0.33	0.37	0.38
Earth yellow	0.64	0.47	0.66	0.09
Green	0.61	0.08	0.17	0.73
Pink	0.55	0.7	0.91	0.58
Purple	0.00	0.00	0.22	0.00
Mean accuracy(%)	46.15	47.65	48.69	49.88

4

Conclusion

How to overcome the influence of bad weather conditions on the image data quality and ensure the accuracy of the detection system under various weather conditions is of great significance to improve the safety of automatic driving and the reliability of intelligent transportation systems. In this paper, the vehicle target detection method for rain and fog scene is studied, and the experimental results are as follows:

In this paper, we design the dynamic vehicle target detection technique based on GAN with dynamic fuzzy compensation technique, and evaluate the performance of this paper’s method under various data strategies. In the quantitative evaluation experiments on a single type of dataset (rain pattern), this paper’s algorithm has a PSNR 5dB higher than CCN in RS_syn. It is demonstrated that this paper’s method can outperform the other state-of-the-art de-raining methods by a large number of experiments.

In the rain and fog scene data-derived effectiveness experiments, the experimental results show that the proposed model in this paper achieves 75.6%, 54.8% and 67.1% in precision, recall and detection accuracy, which is excellent and can accurately detect vehicles of different colors.

Langue:: Anglais

Périodicité:: 1 fois par an
Sujets de la revue:: Sciences de la vie, Sciences de la vie, autres, Mathématiques, Mathématiques appliquées, Mathématiques générales, Physique, Physique, autres

RSS Feed de la revue

Vehicle Target Detection in Rainy and Foggy Scenes Based on Generative Adversarial Networks and Dynamic Fuzzy Compensation Techniques

Tao Dong

Xiaojian Liu

Huihong Xu

Publié en ligne: 29 sept. 2025

Reçu: 14 janv. 2025

Accepté: 22 avr. 2025

DOI: https://doi.org/10.2478/amns-2025-1096

Mots clésWiener filter deblurring algorithm, GAN, UNIT, Vehicle target detection

© 2025 Tao Dong et al., published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
Wiener filter deblurring algorithm, GAN, UNIT, Vehicle target detection