Accès libre

Vehicle Target Detection in Rainy and Foggy Scenes Based on Generative Adversarial Networks and Dynamic Fuzzy Compensation Techniques

,  et   
29 sept. 2025
À propos de cet article

Citez
Télécharger la couverture

Introduction

Weather changes are also an important factor affecting traffic accidents, and bad weather is an important causative factor for traffic accidents [1]. In typical bad weather such as foggy and rainy conditions, the driver cannot clearly identify the direction of travel due to low visibility and insufficient sight distance during driving, and the control performance and stability of the car will be affected [2-3]. Compared with sunny day accidents, the consequences of traffic accidents caused by bad weather are more serious, and the casualty rate increases significantly, about 10 times of normal weather conditions [4].

How to reduce the incidence of traffic accidents and improve the safety performance of road transportation is a global problem that deserves the common attention of mankind, and is also a great challenge faced by scientific researchers and scholars and related practitioners around the world [5-6]. In order to reduce the occurrence of traffic accidents and improve the efficiency of vehicle traffic, automobiles are gradually developing in the direction of intelligence, lightweight, network connectivity, electrification, and sharing of travel modes [7]. With the development of intelligent transportation and smart cities, the development of intelligent driving cars and intelligent driver assistance systems has attracted great attention [8]. As a complex system integrating perception, cognition, planning and control functions, the development of intelligent driving car benefits from the increasing maturity of machine vision, sensor technology, artificial intelligence, automatic control technology and other related technologies [9-11]. With its good application prospects and broad potential market, intelligent driving cars have received financial investment and research and development support from many countries, and have made many breakthroughs in technology [12-13].

The environment sensing technology of road traffic is a crucial link for intelligent vehicles to realize intelligence, and it is also a basic guarantee to realize the safety and intelligence of intelligent transportation [14]. Environment sensing is equivalent to the “eyes” of intelligent vehicles, through the visual sensors to sense the surrounding environment when driving, the most commonly used visual sensors are cameras [15-16]. Compared with active ranging sensors, cameras can provide dense pixel information for the surrounding scene at a relatively low cost, which improves the accuracy of perceiving object categories and shapes, and is widely used in the fields of image processing and target detection [17-18]. Vehicle target is one of the main factors constituting the road traffic scene, and the detection of vehicles refers to the accurate identification and localization of vehicle targets from the acquired visual data [19]. Detection algorithms for vehicles are important component algorithms of visual perception systems, which are widely used in video surveillance, intelligent transportation and other fields [20]. Target detection is an important branch of image processing and computer vision, and one of the most important and indispensable tasks in automated driving, because detecting targets quickly and accurately can not only support precise navigation, but also deal with potential hazards in complex driving environments. It is also a prerequisite for other processing such as vehicle monitoring, vehicle type recognition, vehicle traffic flow statistics and license plate recognition in ITS [21-22].

In recent years, a large number of detection models based on deep convolutional neural networks have been introduced to improve target detection performance [23]. Although these detectors have achieved good accuracy under good weather conditions, in real life there is often a lot of bad weather, such as rain, snow, fog, sand and dust, etc., and the images acquired under these weather conditions will be degraded and blurred, which makes it difficult to extract effective features, so the detection accuracy of these detectors is greatly reduced when they work under these weather conditions, and it is estimated that almost 30% of all traffic accidents are caused by rain and fog weather conditions, which is 70% higher than the weather conditions, because in bad weather conditions, such as rain, snow and fog can interfere with the driver’s vision, so it leads to misjudgment of road situation information [24-27]. Therefore, rain and fog weather conditions have become an important factor affecting road safety, but this situation cannot be eliminated artificially, so how to improve the accuracy of vehicle target detection in rain and fog weather is extremely important for intelligent transportation management [28].

A dynamic blurred image processing method based on Wiener filter and generative adversarial network is first proposed. Using Wiener filter deblurring algorithm, noise is removed by mean square error minimization. Then a Generative Adversarial Network model (GAN) which is free and not distributed by predefined conditions is considered and a UNIT based de-fogging and de-raining algorithm is proposed which introduces an encoder and decoder architecture to capture more useful information. The loss function formulation is further modified by a new loss function to generate realistic and clear images. On this basis, an image de-raining assisted local perception enhanced vehicle detection model is constructed, and Swin Transformer is chosen as the design prototype of the backbone network, and the whole network is divided into two parts: image de-raining and fogging and target detection, and finally, the performance of the proposed method is verified by experiments on both simulated data and real scenarios.

Method
Image processing based on Wiener filter and generative adversarial network
Wiener filter deblurring algorithm

Filter-based deblurring is the estimation of a desirable clear image s from an observed blurred image u0 by some filter estimator, w for the corresponding filter, as represented in equation (1): u^=u^w=w×u0

For a pure denoising process of a noisy image unaffected by blurring, linear filtering can be considered as a natural tool for noise suppression by convolution, and for deblurring, it can be considered as an attempt to remove the effect of a particular convolution operation by another convolution operation [29]. For example, without considering noise, it can be expressed in the form of a Fourier frequency domain, where the Fourier transform is known and the frequency domain quantity is denoted as ω = (ω1, ω2): W(ω)=1K(ω)

From the above equation, u^=w×u0u is easy to realize for any clear image u. But usually a typical blurring kernel K is a low-pass filter, whose Fourier transform K(ω) tends to cause rapid decay in the high frequency part, leading to instability in its process and even making the recovered image seriously distorted.

In order to solve this instability in the recovery process, the above equation is improved as: W(ω)=K*(ω)K(ω)K*(ω)=K*|K|2

where * denotes the conjugate of a complex number and an attempt is made to regularize the instability of the denominator when it is in the high-frequency part by adding a positive factor r = r(ω): WWr=K*|K|2+r

Assuming that the estimate is recorded as u^r , the u^r=Wr×k×u

or in the Fourier frequency domain: WrK(ω)=|K(ω)|2|K(ω)2|+r(ω)

In the low-frequency part of r ≪ |K|2, the image recovery is approximately equal to that expected. At r ≫ |K|2, the high-frequency part is severely distorted due to the near disappearance of k. Thus the regularization factor r acts as an equivalent of a threshold.

Since image noise affects deblurring, it is particularly important to select an optimal regularization factor, for which we use Wiener’s minimum mean square error for the original implementation.

Generating Adversarial Network Models (GAN)

Generative Adversarial Network (GAN) is a network that contains a class generator and a class discriminator. The class generator generates samples that “look similar to real samples” based on the input noise signal, and the class discriminator is used to distinguish between the samples generated by the class generator and real samples [30]. As an example, a photo is generated (G is the class generator and D is the class discriminator):

G receives a random noise x through which a photo is generated, denoted as G(x).

D(y) determines whether the photo is “real” or not. Its input parameter is y, y represents a photo, and the output result is D(y) represents the probability that y is a real photo, if it is 1, it means it is a real photo, and the output result is 0, it means it is not a real photo.

Throughout the training process, the class generator tries to generate real photos to deceive the class discriminator. And the class discriminator tries to distinguish the real photos from the photos generated by the class generator. This realizes a dynamic “game process”.

As the latest form of machine learning, Generative Adversarial Networks (GANs) have the advantages of high-definition output images, high sharpness, and universality to both generators and discriminators compared to general neural networks. Compared to other generative models, GANs no longer require a predefined data distribution and have maximum freedom of fit.

Design of the Wiener filtering algorithm

Wiener filtering hypothesis de-blurring estimation results Image u^ is the result of the filtering process of the blurred image u0 with the optimal filter w(X): u^w=w×u0

Wiener filter w(x) is such that the mean square estimation error: lw(X)=u^w(X)u(X)

Reach the smallest optimal filter: w=arghminE[ek2]=argkminE(h*u0(X)u(X))2

which generates orthogonal conditions: E[[(w×u0(X)u(X)u0(Y))]]=0,X,YΩ

can be transformed as per the correlation function: w×Ru0u0(Z)=Ruu0(Z),ZR2

This leads to the explicit form of the optimal Wiener filter: W(ω)=Ruu0(Z)Ru0u0(Z)

For fuzzy graph u0 = k × u + n, there is: Suu0=K*(ω)Suu|K|2Suu+Snn=K*|K|2+r

where the regularization factor r = sm/suu is the square of the noise-to-signal ratio.

UNIT-based de-fogging and de-rain algorithm
UNIT framework

The UNIT structure is schematically shown in Fig. 1. The UNIT network can be formally viewed as a combination of two VAE/GAN models. The network consists of three main parts: the autoencoder E1, E2, the generative model G1, G2, and the discriminative model D1, D2, x1, x2 representing the pictures in the source and target domains, respectively. For the unsupervised image style conversion task, samples can only be obtained from the respective edge distributions. However, the joint distribution can be obtained from the known edge distributions with countless possibilities, so it is not possible to obtain the joint distribution from the edge distributions without assumptions. Therefore, based on the theory of “shared potential space”, it is considered that any inputs x1 and x2 have a common potential code z in a shared potential space, and the image of the corresponding domain can be generated or restored to the original image through the common potential code z.

Figure 1.

Structure of the UNIT

As shown in (a), there are two different styles of picture domains in UNIT, the source domain X1 and the target domain X2. In the UNIT network model, encoders E1 and E2 are used to realize the mapping of the input pictures x1 and x2 from different domains to a certain shared potential space and encode them into a uniform potential code z, with z = E1(x1) = E2(x2). The generative models G1 and G2 are used to realize the conversion of the potential codes z into the corresponding domains’ pictures through different generative models. pictures, and for generative model G1, whose input is the latent code z, it can be realized to map the latent code z obtained from X1 the domain into the source domain X1, and also the latent code z obtained from X2 the domain into the source domain X1. Like the discriminative models in the original GAN, the discriminative models D1 and D2 are used to determine the authenticity of the images, i.e., to determine whether the input images are real samples or samples generated by the generative model.

As shown in (b), X1 picture x1 in the picture domain is obtained by encoder E1 with potential encoding z, which potential encoding z can be mapped back to the X1 domain by generative model G1 to obtain self-reconstructed image x111 , and can also be obtained by generative model G2 to obtain domain-variant image x112 . Similarly, picture x2x1 in the domain is obtained by encoder E2 with potential encoding z, which potential encoding can be obtained by generative model G1 to obtain domain-variant after image x221 and also image x222 can be obtained by generative model G2.

UNIT-based de-fogging and de-rain algorithm

Network Structure

In this section, the proposed VAE-CoGAN de-fogging and de-raining model can better handle images in foggy and rainy scenes without preclassifying the images. VAE-CoGAN consists of three parts: encoder, generative model, and discriminative model.

The encoder is responsible for converting the input picture codes into vector form to be used as input for generative models in the GAN. For a blurred picture x1 with fog or with rain patterns in the source domain and a clear picture x2 in the target domain, the same potential encoding z is obtained by mapping the picture (x1, x2) to the shared potential space using encoders E1 and E2, with z = E1(x1) = E2(x2).

The generative model is responsible for realizing the conversion of the potential encoding z obtained by the encoder conversion into a picture. For the generative model G1, the generative model yields generative pictures x111 as well as x221 , where the generative picture x221 is a blurred picture obtained by converting a clear picture in the target domain X2, while the generative picture x111 is a self-reconstructed picture that has not been converted to a style, and does not help in the task of converting the image styles investigated in this chapter. Similarly, for generative model G2, whose input is latent encoding z, generative images x112 and x222 can be obtained through generative model G2, where the generated image xi12 is a clear picture transformed from a blurred image in the source domain Xi, that is, the clear picture obtained by the foggy or rainy picture that is intended to be obtained in this chapter, and the generated picture x222 is a self-reconstructed picture without style conversion, which is not helpful for the task of image style conversion studied in this chapter.

Loss function

The objective function of this network consists of four components: the VAE loss, the GAN loss, the cyclic consistency loss, and the VGG perceptual loss. The objective function is as follows: L(E1,E2,G1,G2,D1,D2)=LVAE1(E1,G1)+LVAE2(E2,G2)+LGAN1(E2,G1,D1)+LGAN2(E1,G2,D2)+LCC1(E1,G1,E2,G2)+LCC2(E2,G2,E1,G1)+Lcontent(E1,E2,G1,G2)

The VAE loss function aims to minimize the objective function, and its loss function consists of two components, regularization and reconstruction error: LVAE=Lpnor+Lllikepixel .

where Lprior=KL(q(z|x)||pη(z)),Lllikepixel=Eq(z|x)[logp(x|z)] .

Regularization provides a simple way to sample from the latent space, and minimizing the negative log-likelihood term in the reconstruction error is equivalent to minimizing the absolute distance between the image and the reconstructed image. This chapter uses two auto-variable encoders with: LVΛE1(E1,G1)=λ4KL(q1(z1|x1)||pη(z))λ5Ez1~q1(z1|x1)][logpG1(x1|z1|)] LVAE2(E2,G2)=λ4KL(q2(z2|x2|)||pη(z))λ5Ez2q2(z2|x2)][logpG2(x2|z2|)]

where λ4 and λ5 are hyperparameters that control the weights of the two items. the KL divergence term indicates the distance between q(zx) and pη(z), and the smaller the KL value, the smaller the distance.

In this model, the GAN loss function is used to ensure that the generated images are as similar as possible to the images in the target domain: LGAN1(E2,G1,D1)=λ0Ex1~pq[logD1(x1)]+λ0Ez2~q2(z2|x2)[log(1D1(G1(z2)))] LGAN2(E1,G2,D2)=λ0Ex2~px2[logD2(x2)]+λ0Ez1~q1(z1|x1)[log(1D2(G2(z1)))]

Pictures x221 and x112 with style transformations performed in generative model Gi and generative model G2 are examined, there are x221=F21(x2)=G1(E2(x2)) and x112=F12(x1)=G2(E1(x1)) . The experimental results are made more realistic by incorporating the idea of cyclic consistency, there are: x1 = F2−1(F1−2(x1)), x2 = F1−2(F2−1(x2)). For cyclic consistency loss there are: dLcc1(E1,G1,E2,G2)=λ6KL(q1(z1|x1)||pη(z))+λ6KL(q2(z2|x112)||pη(z))λ7Ez2q2(z2|x112|)[logpG1(x1|z2)] Lee2(E2,G2,E1,G1)=λ6KL(q2(z2|x2|)||pη(z))+λ6KL(q1(z1|x221)||pη(z))λ7Ez1~q1(z1|x221)][logpG2(x2|z1))]λ7Ez1~q1(z1|x221)][logpG2(x2|z1|)]

The formula for VGG loss is: Lcontent(E1,E2,G1,G2)=Ex1~px1[||φi(G1(E2(G2(E1(x1)))))φi(x1)||0]+Ex2~px2[||φi(G2(E1(G1(E2(x2))))))φi(x2)||1]

where φi is the activation of layer i of the CNN.

The final goal in this network is: G*,F*=argminG,FmaxDX1,DX2,DY1,DY2L(G,F,DX1,DX2,DY1,DY2)

Here λ0 = 10, λ4 = 0.1, λ5 = 100, λ6 = 0.1, λ7 = 100.

Image De-Rain Assisted Local Perception Enhanced Vehicle Detection Model
Detection network structure

This section focuses on the network structure of the vehicle detection part and proposes a detection framework with local perceptual enhancement compared to the most commonly used CNN-based detection algorithms. Unlike the detection backbone network used for foggy images, the model feeds the features extracted from the de-raining part to the local perceptual enhancement Transformer backbone network, and the vehicle detection network is shown in Fig. 2. Similar to Swin Transformer, each stage has 2, 2, 6 and 2 blocks respectively.

Figure 2.

Vehicle detection network

First, given a rainy day image of size H × W × 3, the input image H × W × 3 is partitioned into a set of non-overlapping image patches by image block partitioning, where each image patch has size 4 × 4, feature dimension 4 × 4 × 3, and number H4×W4 , and then by a linear embedding, where each patch is regarded as a vector “token”, and its features are viewed as a Each patch is then treated as a vector “token” through a linear embedding, and its features are viewed as a concatenation of the original pixel RGB values. After changing the feature dimension of the segmented patch token to 4 × 4 × C, it is fed into multiple locally aware Swin Transformer blocks for encoding, which contain blocks at different stages. After that the input is merged by image block merging the neighboring image blocks according to 2 × 2. This changes the number of image blocks to H8×W8 and the feature dimension to 4C. Repeat this step for N times for each stage until the number of image block chunks becomes H32×W32 and the feature dimension becomes 8C. Finally it is fed to the regression header for target classification and localization regression. Each stage consists of an image block merging block or a linear embedding block and a stacked local perceptual enhancement Transformer block.

Local Perception Enhancement Transformer Block (LPST)

The positional coding in Transformer easily fails to detect local correlation and structural information in the image, Swin-Transformer m used a window-based hierarchical structure to solve the scaling problem and high computational complexity of high-resolution images. Each Swin-Transformer block consists of a normalization layer, a multi-head self-attention module, residual connectivity, and a multilayer perceptron (MLP), and has two fully connected layers with GELU nonlinearity. The window-based multi-head self-attention (W-MSA) module and the shift-window-based multi-head self-attention (SW-MSA) module are applied in two consecutive Transformer blocks, respectively.

Although Swin Transformer constructs a hierarchical Transformer and implements attentional operations in each non-overlapping window, Swin transformer is limited in its ability to encode contextual information, in order to enhance the network’s learning of local relevance and structural information, this paper proposes a locally-aware augmented Transformer. Each block is composed of two consecutive improved Transformer blocks.

The standard convolution kernel dilation process can be viewed as the spacing of the values of the convolution kernel when doing data processing. Expansion convolution introduces hyperparameter i.e. expansion rate r to the convolutional layer, and the expansion rate r of ordinary convolution is 1, i.e. r = 1 when the expansion convolution is standard convolution. The expansion process of dilation convolution can be seen as the convolution kernel horizontal and vertical neighboring weights between the addition of the void parameter 0, the number of intervals is (k − 1), each interval to join the number of 0’s is r − 1. The addition of the void parameter 0 does not need to participate in the convolution operation. So the relationship between the size of the dilated convolution kernel is shown in equations (23) and (24): kd=k+(k1)×(r1) kd=k+(r1)×(k1)

where kd represents the size of the dilation convolution kernel, k represents the size of the standard convolution kernel, and r represents the dilation rate. In particular, when r is set to 1, there is no null parameter, i.e., it is a standard convolution. Usually, the dilation convolution which has the same parameters as the standard convolution has more weight parameters and larger size to capture the global information in the image through the enlarged sensory field and discretized weight parameters [31].

Further after the introduction of dilated convolution, the size of the corresponding output feature map is calculated as in equation (25): 0=[i+2pk(k1)*(r1)s]+1 $$0 = \left[ {{{i + 2p - k - (k - 1)*(\>r\> - 1)} \over s}} \right] + 1$$

Where, o represents the output feature map size, i represents the image size size of the input null convolution, p represents the input feature map boundary padding, and s represents the step size. The above formula shows that if the convolution kernel are after dilation convolution operation and maintain its step size, boundary pixel padding and other parameters unchanged, its input the same feature map, after the convolution operation, the size of the output feature map will be correspondingly smaller.

Results and discussion
Experiments on robust rain removal methods for multiple rain degradation types
Experimental setup details

Comparison methods

The algorithm in this paper will be compared with three different classes of rain removal methods, including (1) one raindrop removal method, AttentGAN.(2) six rain pattern removal methods, including DetailNet, RESCAN, PReNet, PReNet, JORDER-E, RCDNet and RLNet.(3) two robust rain removal methods, including Pix2pix and CCN. In addition to the algorithms in this paper, the RadNet-algorithm is also tested in this paper.

Datasets

In this paper, four datasets are selected as the training and testing datasets, including: Rain200H, Rain200L, RainDrop and RainDS. RainDS contains three synthetic data subsets (RS_syn, RD_syn and RDS_syn) and three real data subsets (RS_real, RD_real and RDS_ real). Based on the above datasets, three data strategies are designed to test the robustness of the rain removal method:(1) Single-type data strategy, i.e., the dataset is a single dataset and each image contains only one type of rain degradation, and the eligible datasets include the single rain pattern dataset (Rain200H, Rain200L, RS_syn and RS_real) and the single raindrop dataset ( RainDrop, RD_syn and RD_real). (2) Stacked data strategy, i.e., the dataset is a single dataset but contains two rain degradation types in each image, the eligible datasets include RDS_syn and RDS_real.(3) Hybrid data strategy, i.e., multiple datasets are mixed together, and each image may contain a single degradation or multiple degradation types, and three hybrid datasets are constructed Blended- 1={RD_syn+RS_syn+RDS_syn}, Blended-2={RD_real+RS_real+RDS_real} and Blended-3={Rain200H+Rain200L+RainDrop}. In addition to this, real scene data was collected from the Internet and previous work to construct real datasets as benchmarks to examine the de-emphasis ability of each method in real scenes. Peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) are used to evaluate the paired data. For unlabeled data, comparisons were made based on visualization results.

Training details

The algorithms in this paper were trained on two NVIDIA GeForce GTX 3080 GPUs with 24 GB of RAM using the Pytorch deep learning framework in a Python environment. Adam was chosen as the optimizer, and the weight decay and momentum were set to 0.0001 and 0.9, respectively. The initial learning rate is set to 1e-3 for RAM and DRM and 1e-6 for FWM, and the learning rate will be decayed by multiplying 0.2 every 30 epochs. Each image will be randomly cropped to 128×128 pixels. 100 epochs are trained to converge the network and the batch size is set to 16.

Simulation experiment results and analysis

Experimental results and analysis on single-type data

Firstly, the performance of all methods is tested under single type strategy, and the quantitative evaluation results on single type dataset (rain pattern) are shown in Table 1, and on single type dataset (raindrop) are shown in Table 2. It can be seen that (1) the performance of this paper’s algorithm is much better than other methods on two real datasets, RS_real and RD_real. (2) Compared with CCN, which is also a robust method, this paper’s algorithm obtains better performance on all but the RainDrop dataset. Specifically, the PSNR of this paper’s algorithm in RS_syn is 5 dB higher than that of CCN.(3) The performance of this paper’s algorithm on the RainDrop dataset is poorer than that of CCN, which may be attributed to the fact that CCN has two independent modules to process rainbars and raindrops separately, so that it can deal with the RainDrop data that contains large blurring regions. (4) In terms of the average score, the algorithm in this paper has a very competitive performance.

Quantitative evaluation results on single-type dataset (rain streak)

Method Rain streak Average
Rain200H Rain200L RS_syn RS_real
AttentGAN 22.98/0.73 28.29/0.885 27.53/0.658 24.42/0.425 25.81/0.675
DetailNet 26.34/0.835 34.41/0.869 30.86/0.686 26.18/0.84 29.45/0.808
RESCAN 26.71/0.114 37.02/0.723 38.64/0.982 26.36/0.938 32.18/0.689
PReNet 28.17/0.869 36.82/0.951 39.37/0.969 25.93/0.998 32.57/0.947
JORDER-E 29.51/0.741 39.29/0.975 40.16/0.737 26.31/0.603 33.82/0.764
RCDNet 30.77/0.775 39.79/0.286 44.16/0.833 27.24/0.991 35.49/0.721
RLNet 29.47/0.787 38.36/0.992 37.06/0.966 26.85/0.755 32.94/0.875
Pix2pix 24.1/0.496 29.78/0.943 28.06/0.815 24.9/0.937 26.71/0.798
CCN 28.98/0.722 37.86/0.864 35.1/0.904 26.81/0.901 32.19/0.848
RadNet - 30.38/0.995 38.74/0.985 39.17/0.891 26.67/0.773 33.74/0.911
Ours 30.23/0.985 38.56/0.855 39.57/0.201 27.69/0.929 34.01/0.743

Quantitative assessment results on single-type data sets (raindrops)

Method Raindrop Average
RainDrop RD_syn RD_real
AttentGAN 30.6/0.932 27.26/0.87 21.75/0.669 26.54/0.824
DetailNet 25.02/0.594 28.42/1.262 22.13/0.821 25.19/0.892
RESCAN 25.54/0.953 34.45/0.885 23.03/0.584 27.67/0.807
PReNet 25.6/0.526 34.92/0.8 23.66/0.465 28.06/0.597
JORDER-E 26.62/0.989 35.55/0.832 23.83/0.918 28.67/0.913
RCDNet 26.28/0.887 35.18/0.991 24.36/0.837 28.61/0.905
RLNet 26.6/0.518 33.28/0.921 23.85/0.726 27.91/0.722
Pix2pix 25.55/0.935 25.07/0.493 20.46/0.779 23.69/0.736
CCN 31.49/0.871 33.45/0.815 24.63/0.888 29.86/0.858
RadNet - 24.66/0.945 35.4/0.922 23.69/0.832 27.92/0.900
Ours 24.09/0.475 35.61/0.582 28.25/0.943 29.32/0.667

Experimental results and analysis of superimposed type data

The task of de-raining under this data strategy is more difficult than the single-type strategy, mainly because the network needs to synchronize the processing of rain patterns and raindrops. The quantitative evaluation results on the superimposed-type data set are shown in Table 3. From the PSNR/SSIM results in the table, it can be seen that (1) the algorithm in this paper obtains the best performance on both synthetic dataset (RDS_syn) and real dataset (RDS_real). Compared with CCN, the algorithms in this paper have 2dB PSNR improvement on RDS_syn and 4dB PSNR improvement on RDS_real. (2) All the methods perform poorly on RDS_real data, which is mainly due to the fact that the image pairs in it do not correspond at the pixel level, which makes it difficult for all the supervised methods. And based on the effectiveness of the proposed FWM, the algorithm in this paper can still achieve excellent results.

Quantitative evaluation results on superimposed-type dataset

Method RDS_syn RDS_real Average
AttentGAN 24.91/0.816 21.03/0.999 22.97/0.908
DetailNet 26.56/0.814 22.43/0.19 24.50/0.502
RESCAN 31.65/0.898 21.66/0.446 26.66/0.672
PReNet 32.79/0.782 22.8/0.893 27.80/0.838
JORDER-E 33.3/0.372 23.09/0.352 28.20/0.362
RCDNet 34.18/0.744 23.33/0.817 28.76/0.781
RLNet 32.29/0.861 23.73/0.581 28.01/0.721
Pix2pix 23.78/0.658 20.16/0.652 21.97/0.655
CCN 32.15/0.98 22.81/0.805 27.48/0.893
RadNet - 34.25/0.981 23.55/0.908 28.90/0.945
Ours 34.07/0.801 27.06/0.807 30.57/0.804

Experimental results and analysis of hybrid data

The de-raining task under this data strategy is more difficult than the above two strategies, not only needing to synchronize the two degenerate phenomena of rain patterns and raindrops, but also suffering from the fitting problems caused by the different distributions of the datasets. The results of the quantitative assessment on the hybrid dataset are shown in Table 4. From the PSNR/SSIM results in the table, it can be seen that: this paper’s algorithm achieves the best performance, with 1dB PSNR over RCDNet on the Blended-1 dataset, while on the Blended-2 dataset, this paper’s algorithm outperforms RCDNet by close to 3dB PSNR and 0.011dB SSIM. From the results, it can be seen that the algorithm in this paper is 1dB PSNR higher than CCN*.

Quantitative evaluation results on blended-type dataset

Method Blended-1 Blended-2 Blended-3 Average
AttentGAN 26.09/0.414 22.64/0.446 23.91/0.916 24.21/0.592
DetailNet 27.15/0.893 23.52/0.905 23.78/0.803 24.82/0.867
RESCAN 33.16/0.776 23.49/0.897 28.56/0.987 28.40/0.887
PReNet 34.19/0.929 23.96/0.982 29.88/0.791 29.34/0.901
JORDER-E 34.99/0.828 24.2/0.812 28.23/0.822 29.14/0.821
RCDNet 35.54/0.725 24.49/0.902 29.49/0.761 29.84/0.796
RLNet 35.54/0.873 25.31/0.908 30.61/0.924 30.49/0.902
Pix2pix 24.59/0.73 22.49/0.581 24.53/0.68 23.87/0.664
RadNet - 36.65/0.882 24.27/0.808 30.74/0.708 30.55/0.799
Ours 36.82/0.582 30.02/0.947 30.14/0.888 32.33/0.806
CCN* 33.6/0.735 24.58/0.561 32.91/0.592 30.36/0.629
RadNet -* 36.37/0.931 24.75/0.872 31.34/0.948 30.82/0.917
Ours* 36.39/0.941 27.49/0.913 31.21/0.958 31.70/0.937

The performance of the different methods under the three data strategies is shown in Fig. 3 (Fig. a shows the results of PSNR values and Fig. b shows the results of SSIM values). It can be found that (1) the algorithm in this paper is only slightly weaker than RCDNet on single-class data strategy, while the best performance is obtained in all other cases. (2) This paper’s algorithm improves very significantly on the stacked data strategy and hybrid data strategy, which mainly test the robustness of each method, and thus the robustness of this paper’s algorithm can be fully verified. (3) The algorithm in this paper is far better than another robust rain removal method CCN.

Figure 3.

Different methods are performed under the strategy of three

Experiments on target detection performance of this paper’s model in rainy and foggy scenes
Model training

Dataset library construction

When performing the target detection task under rain and fog conditions, it is crucial to obtain a sufficient number of rain and fog dataset samples. However, there are relatively few rain and fog datasets under public traffic conditions, which poses a challenge to the performance improvement of deep learning network models under rain and fog conditions. To address this problem, this paper uniformly generates fog-containing data samples with different concentrations based on the atmospheric scattering model using the BDD100K dataset as a basis, a process that allows the dataset to contain images of foggy conditions with different concentrations, ranging from light fog to dense fog, covering a variety of foggy conditions. However, the samples generated by relying only on the atmospheric scattering model have certain singularities and limitations. To further enrich the dataset, an adversarial generative network is employed to generate more realistic and diverse fog-containing images, effectively expanding the size and diversity of the dataset. At the same time, real rainy day condition images are manually collected and labeled to enrich the training data, and a rainy fog condition dataset integrated with real, physical models and adversarial generative networks is constructed. The rain and fog images generated by this method have different concentration differences, which improves the breadth and abundance of the dataset and provides more abundant data resources for the subsequent target detection model based on convolutional neural network.

Model parameter setting

The model in this paper is implemented using Pytorch framework and trained based on NVIDIA RTX 3090 GPU. During the training process, the original image size is 640 360, which is resized to 640640 during input and training of the model. The original size is maintained during testing. The initial learning rate is 0.01 and the optimizer uses stochastic gradient descent (SGD). The momentum parameter and weight decay parameter are set to 0.937 and 0.0005 respectively, the batchsize is 8, the number of training rounds is 200, and the data enhancement strategy uses Mosaic and image left-right flip data enhancement strategies as a way of enriching the dataset and enhancing the generalization of the model.

Experimental results of data derivation and sensory field amplification for rain and fog scenarios

Experimental results and analysis of the effectiveness of data derivation for rain and fog scenes

In order to verify the effectiveness of the data augmentation method, by inputting the Haze Sim dataset, the GANHaze dataset and the AtmoGAN Haze dataset into the model network model of this paper respectively, and using mAP0.5 as the evaluation index, the data augmentation results are shown in Table 5. As can be seen from the table, the dataset achieves 75.6%, 54.8%, and 67.1% in precision (P), recall (R), and mean category accuracy (mAP), respectively, and the AtmoGAN Haze dataset after data expansion improves the detection effect by 4.3% relative to the un-expanded HazeSim dataset, and improves the detection effect by 1.1% relative to the GANHaze dataset Compared to the pre-expansion dataset, the AtmoGAN Haze dataset has better performance.

Data amplification

Data set P R mAP
AtmoGAN Haze 75.6% 54.8% 67.1%
HazeSim 72.8% 49.8% 62.8%
GANHaze 67.3% 45.5% 66%
Fog Traffic 85.5% 58.9% 72.7%

The detection accuracy in each category is shown in Table 6. Secondly, the detection accuracy in each category shows that the AtmoGAN Haze dataset achieves relatively high mAP for pedestrians, riders, cars, buses, trucks, bicycles, and motorcycles. mAP for pedestrian detection reaches 48.1%, which achieves the highest detection result compared to the unexpanded dataset. It proves that the expanded dataset can detect pedestrians better. For the detection accuracy of riders, cars, buses and trucks, AtmoGAN Haze also achieves 71.1%, 80.9%, 66.8% and 60.4%, respectively, which maintains the leading position compared to the unexpanded dataset. This result demonstrates that the fog-containing dataset augmented with different methods enhances the generalization of the model, and also highlights the superiority and usefulness of the AtmoGAN Haze dataset in image processing tasks, which provides an important reference for further research and applications.

No detection accuracy

Data set Pedestrian Rider Car Bus Waggon Bicycle Motorcycle
AtmoGAN Haze 48.1% 71.1% 80.9% 66.8% 60.4% 64.2% 73.7%
HazeSim 44.8% 67.3% 68.2% 71.9% 57.7% 60.2% 66.7%
GANHaze 57.6% 60.2% 86.3% 54.3% 51% 27.2% 57%
Fog Traffic 70.6% 70.4% 81.4% 84.6% 77.2% 65.2% 62.7%

In order to achieve the same good detection results of the model under rainy conditions, the 750 dataset obtained from Rain dataset, 500 images were added to the training set of AtmoGAN Haze dataset, and the other 250 images were added to the test set to form the rainy and foggy condition dataset HydroFogRain for this study. This dataset was constructed to ensure that the model has robust target detection performance in the rainy and foggy conditions with robust target detection performance. Based on the HydroFogRain dataset, the network model of this paper’s model with sensory field amplification is compared with the current mainstream model YOLOV5, and the YOLOV5 comparison results are shown in Table 7. It can be seen that the precision rate of this paper’s model reaches 77.2%, which is higher than that of YOLOV5, indicating that this paper’s model is more accurate in predicting positive samples. Secondly, the recall rate reaches 53.8%, which is also higher than YOLOV5, indicating that this paper’s model can better capture the real positive examples in the positive samples, and the mAP of this paper’s model reaches 66.2%, which is significantly higher than YOLOV5, indicating that this paper’s model performs better in terms of comprehensive performance.

YOLOV5 comparison results

Model P R mAP Parameter quantity GFLOPS
YOLOV5 76.9% 53% 56.5% 45.3M 109.4%
Ours 77.2% 53.8% 66.2% 54.4M 302.1%

The detection accuracy for each category is shown in Table 8. As can be seen from the table, for various different traffic scenarios, this paper’s model achieves relatively high detection accuracies for pedestrians, riders, cars, buses, trucks, bicycles and motorcycles. First, the detection accuracy of this paper’s model is significantly better than that of YOLOV5 for pedestrians and riders, reaching 53.1% and 73.5%, respectively, compared with only 44.5% and 59.1% for YOLOV5, indicating that this paper’s model is more capable of accurately detecting pedestrians and riders in complex situations. Second, this paper’s model performs equally well on other categories. For example, in the detection accuracy of targets such as cars, buses and bicycles, this paper’s model achieves 80.9%, 62.9% and 66.5%, respectively, which is higher than YOLOV5’s 80.3%, 56% and 46.6%. It shows that this paper’s model has better detection performance for different traffic targets. Finally, the detection accuracy of this paper’s model is also higher than that of YOLOV5 for trucks and motorcycles, which are 60.3% and 75.2%, respectively. In summary, the model in this paper performs well in the detection accuracy of multiple categories, which is especially suitable for the target detection task in complex scenes.

No detection accuracy

Model Pedestrian Rider Car Bus Waggon Bicycle Motorcycle
YOLOV5 44.5% 59.1% 80.3% 56% 56.9% 46.6% 66.5%
Model of this paper 53.1% 73.5% 80.9% 62.9% 60.3% 66.5% 75.2%

Comparison of mainstream target detection models

In order to verify the effectiveness of the detection model, the proposed target detection model with sensory field amplification This paper’s model is compared with some mainstream deep learning networks based on the HydroFogRain dataset, including Faster R-CNN, SSD, YOLOV3, YOLOV3-SPP, YOLOV4, YOLOV7, YOLOV8, and DETR, and the comparative experiments are shown in Table 9, which clearly shows that the proposed model algorithm in this paper has a much higher mAP than other mainstream detection models and almost twice as much as SSD when compared to other models. Although FasterR-CNN represents a two-stage detection algorithm with better detection accuracy, the detection accuracy is far less than that of this paper’s model and is not applicable to real-time applications. In contrast, YOLOV3, YOLOV3-SPP and YOLOV4 have faster detection speeds but lower detection accuracy and overall poor results. Compared with YOLOV5X, the model in this paper performs well in terms of detection precision and recall, while the number of parameters is much smaller than that of YOLOV5X. Compared with YOLOV7, YOLOV8, and DETR, the model in this paper shows the best performance in terms of precision, recall, and detection speed, although the number of parameters does not dominate. Taken together, the target detection algorithm proposed in this study for this paper’s model achieves the best level in terms of detection precision, and although the detection speed is slightly reduced compared to some YOLO series models, it is still the optimal choice in terms of comprehensive performance.

Comparison experiment

Model P R mAP Parameter FPS
Faster R-CNN 50.2% 59.6% 57.9% 62M 15.5
SSD 33.2% 44.3% 38.8% 68.6M 31.3
YOLOV3 75.8% 52.2% 59.5% 61.4M 49.7
YOLOV3_SPP 73.5% 46.6% 54.3% 64.5M 43.8
YOLOV4 69.5% 45.9% 48.8% 63.9M 50.5
YOLOV5X 78.9% 53.7% 57.8% 85.8M 27.2
YOLOV7 75.9% 53.5% 59.4% 34.4M 42.2
YOLOV8 70.1% 51.6% 59.6% 41.8M 34.1
DETR 61.8% 44% 46.7% 31.7M 28.8
YOLO-Z 71.5% 47.7% 52.8% 55.6M 28.2
Ours 77.1% 52.4% 68.2% 53.2M 35.5
Vehicle color detection results

In order to discuss the vehicle color recognition performance under rainy conditions, quantitative evaluations and qualitative comparisons of the methods in this paper, Da-Faster, SA-Da-Faster, and SMNN-MSFF were performed. To ensure the fairness of the comparison experiments, all settings are compared against the original text. The three algorithms of this paper method, Da-Faster and SA-Da-Faster, train network models on the labeled source domain dataset Vehicle Color-24 and the unlabeled target domain dataset Rain Vehicle Color-24. In this case, there are 8094 images in the training set of Vehicle Color-24 and 8194 images in the training set of Rain Vehicle Color-24, SMNN-MSFF is trained using the training set of Rain Vehicle Color-24 dataset, and after the training is completed all the algorithmic models are trained on the Rain Vehicle Color-24 on 576 test set images and the detection accuracy of different algorithms on Rain Vehicle Color-24 for each category is shown in Table 10. The table shows the AP values for each color category and finally the average AP values for all the color categories are given. The experimental results in the table show that the method proposed in this paper has the highest mAP value compared to other state-of-the-art unsupervised domain-adapted target detection methods and vehicle color recognition algorithms, and the mAP values are 3.73%, 2.23%, and 1.19% higher than Da-Faster, SA-Da-Faster, and SMNN-MSFF, respectively, and it can be very good in improving the vehicle color recognition under rainy weather condition accuracy of the task. Overall, the method in this paper can reduce the domain differences of the model in the target domain and improve the localization accuracy.

Detection accuracy

Algorithm Da-Faster SA-Da-Faster SMNN-MSFF Ours
White 0.78 0.63 0.64 0.77
Black 0.8 0.52 0.65 0.73
Orange 0.8 0.83 0.75 0.77
Silver grey 0.85 0.27 0.34 0.54
Grass green 0.69 0.8 0.84 0.82
Deep grey 0.73 0.27 0.45 0.54
Scarlet 0.77 0.65 0.49 0.8
Gray 0.17 0.04 0.25 0.12
Red 0.59 0.65 0.46 0.61
Green color 0.76 0.85 0.6 0.75
champagne 0.57 0.17 0.33 0.28
Dark blue 0.68 0.39 0.54 0.55
Blue 0.72 0.58 0.59 0.74
Dark brown 0.45 0.07 0.38 0.27
Brown 0.27 0.38 0.32 0.21
Yellow 0.52 0.66 0.31 0.31
Lemon yellow 0.87 0.96 0.57 0.43
Dark orange 0.61 0.63 0.34 0.99
Dark green 0.37 0.3 0.57 0.09
Salmon 0.27 0.33 0.37 0.38
Earth yellow 0.64 0.47 0.66 0.09
Green 0.61 0.08 0.17 0.73
Pink 0.55 0.7 0.91 0.58
Purple 0.00 0.00 0.22 0.00
Mean accuracy(%) 46.15 47.65 48.69 49.88
Conclusion

How to overcome the influence of bad weather conditions on the image data quality and ensure the accuracy of the detection system under various weather conditions is of great significance to improve the safety of automatic driving and the reliability of intelligent transportation systems. In this paper, the vehicle target detection method for rain and fog scene is studied, and the experimental results are as follows:

In this paper, we design the dynamic vehicle target detection technique based on GAN with dynamic fuzzy compensation technique, and evaluate the performance of this paper’s method under various data strategies. In the quantitative evaluation experiments on a single type of dataset (rain pattern), this paper’s algorithm has a PSNR 5dB higher than CCN in RS_syn. It is demonstrated that this paper’s method can outperform the other state-of-the-art de-raining methods by a large number of experiments.

In the rain and fog scene data-derived effectiveness experiments, the experimental results show that the proposed model in this paper achieves 75.6%, 54.8% and 67.1% in precision, recall and detection accuracy, which is excellent and can accurately detect vehicles of different colors.