Acceso abierto

Design of Convolutional Neural Network Optimization Algorithm Based on Embedded System and Its Application in Real-Time Image Processing

 y   
24 mar 2025

Cite
Descargar portada

Introduction

With the development and application of computer and artificial intelligence technology, computer vision has become a hot research direction. Real-time image processing, as an important research direction of computer vision, is a research hotspot in theory and application. In recent years, with the explosion of concepts such as intelligent security, automatic driving, defect detection and the corresponding chip development of neural networks, the technology of AI deployment on embedded devices is also in rapid development [1-3]. Although image processing technology has achieved some results in the above mentioned fields, there are still drawbacks such as the need for a huge backend server for computing support, huge power consumption, high requirements on hardware resources of the device, and difficulty in reducing the size of the device [4-7]. Moreover, the current image processing technology is too much to pursue the improvement of the index and caused a large number of redundant model parameters, this phenomenon exacerbates the problem of difficult deployment of image processing technology embedded devices [8-11].

A complete convolutional neural network (CNN) includes an input layer, a hidden layer, and an output layer [12]. Each neuron in the hidden layer is connected only to the local neurons of the previous layer and extracts the features of that local domain, each of its computational layers consists of multiple feature mappings, each feature mapping is a plane with equal weights of all neurons in the plane, and the output layer receives the vectors from the hidden layer, which can be designed to output the center coordinates, sizes, and classifications of objects [13-17]. Existing studies have shown that CNN image processing is mostly trained and tested on PCs, and the central processing unit (CPU)-based algorithms can only be executed sequentially, which brings about significant time delays when dealing with computations with large amounts of data, and even high-performance computers can not fully guarantee low-latency real-time image processing [18-21]. Therefore, it is important to study how to reasonably reduce the computation of image processing models and deploy them in embedded systems with limited hardware resources.

Currently, related scholars have carried out research on the application of image recognition based on convolutional neural networks in embedded systems.Udendhran, R. et al. described the application scenarios of embedded vision technology in the medical field, which utilizes deep learning algorithms to solve the performance and energy consumption problems of embedded vision systems in processing images, and presents better results on task-specific datasets [22]. Wang, X. et al. emphasized that convolutional neural networks have great advantages over machine learning methods in the field of image processing, and verified the recognition performance of Fast Regional Convolutional Neural Networks (Faster R-CNN) on the NVDIA Jetson TK1 embedded network, and found that it not only saves time, but also has strong robustness and high accuracy [23]. Hedegaard, L. et al. evaluated the role played by continuous 3D convolutional neural network (Co3D CNN) since the frame-by-frame processing of video, which omits the repeated convolution of the frame image to reduce computational redundancy and reuses the existing 3D-CNN weights to achieve high real-time processing performance by reducing the ratio of floating-point operations (FLOPs) per prediction to time-sensitive wilds while preserving memory requirements and accuracy [24].

With the continuous increase in the size of convolutional neural network parameters and computation, hardware acceleration has become the basic form of implementation of convolutional neural network computing, and the current mainstream embedded image processing platforms include dedicated integrated chip (ASIC) platforms, field-programmable gate array (FPGA) platforms, and embedded GPU platforms. ASIC image processing platforms are integrated circuits specially designed and produced according to the requirements and characteristics of image processing algorithms. Designed and produced integrated circuits.Harb, S. et al. designed a reconfigurable embedded system for real-time image steganography problem in the field of information security using APSoC platform, which significantly reduces the computational latency and improves the throughput in processing of real-time image steganography scheme, and improves the efficiency of embedding and recovering the secret data in the digital image [25]. Kim, J. et al. utilized AI for large-scale content-aware coding to optimize streaming computational steps and bit rates for effective viewing experience and efficiency in bandwidth-constrained environments, and adapting this approach to dedicated integrated circuits in data-centric video ASICs can further improve cost-effectiveness and scalability [26]. Strempfer, S., et al. proposed a lightweight and configurable detector with integrated compression ASIC digital architecture that includes user-selectable lossy and lossless compression blocks and is capable of supporting an increase in pixel detector frame rate into the continuous 1 MHz range [27]. It can be found that ASIC specialized integrated circuits are able to provide excellent performance for the deployed AI algorithms, with excellent performance in terms of throughput, latency, and power consumption.

FPGA platform is a combination of Programmable Array Logic (PAL) and General Purpose Array Logic (GAL), and at the same time improves the shortcomings of ASIC which can not be changed once the design is completed.Pedre, S. et al. designed an embedded system based on FPGA hardware acceleration support, which is applied to global vision algorithms for multi-robot localization, and is able to realize the full-time work, and the algorithm’s image processing speed power consumption energy saving are significantly improved, which is practical in embedded real-time image processing applications [28]. Kaur, A. introduced the application of FPGAs in the fields of computer vision and image processing, and developed an FPGA-based embedded architecture for augmented reality applications to obtain high-resolution and high-frame-rate real-time results, as well as to provide augmented reality applications with mobility and processing power to further enhance user immersion [29]. Hussain, S. M. et al. designed an algorithm to process CMOS image data pixel-by-pixel and transfer it directly to a computer for post-processing using FPGAs, and experiments have shown that the proposed algorithm’s image detection rate and accuracy have been effectively improved compared to other algorithms, and the power consumption has been significantly reduced [30]. Siddiqui, F. et al. examined the target recognition performance of FPGA embedded image processing system in k-means clustering operation, and experiments showed that the designed 16-core FPGA image processing soft processor (IPPro) significantly improved the efficiency of k-means clustering operation [31]. Compared with AISC, FPGA has more powerful programmability, and its flexible algorithmic adaptability can be adapted to image recognition tasks in a variety of fields in order to quickly respond to the needs of different applications.

Embedded GPU platforms have evolved from general-purpose GPUs to accommodate the embeddedness of image processing devices. Meribout, M. et al. showed that embedded GPU platforms have become a widespread choice for processing real-time machine vision tasks thanks to the availability of development tools as well as the cost and computational power consumption, while system-on-chip (SOC) processors combining GPU platforms with ASIC/FPGA gas pedals also play an important role in embedded systems for the Internet of Things for machine vision applications [32]. Yang, T. et al. designed a parallel computing method for chirp scaling algorithm using parallel programming model as a way to propose an optimization method for the memory and performance of embedded GPU hardware architecture, and the experiments showed that the distributed SAR real-time imaging method based on this method has high performance and low power consumption [33]. Romera, T. et al. explored the optimization of embedded GPUs for iterative class template image processing algorithms with full variational (TV) regularization and optical flow estimation of L1 paradigm as GPU architecture optimization algorithms, and found that it significantly reduces the time of image processing and reduces the energy consumption [34]. Embedded GPU platform with parallel data computing capability stands out in a crowd of embedded image processing platforms, which can reason about multiple images at a time, which improves the real-time nature of the algorithm to a certain extent, which is reflected in the effective reduction of image processing time.

This paper optimizes the CNN algorithm based on an embedded system and combines it with Zynq for image processing. Explain the basic concepts of 4 common operators of CNN and analyze the parallelism of convolutional computation. Define the basic meaning of Zynq embedded system. Optimize the Im2col-Gemm algorithm using the Darknet framework to further improve its speed and reduce its computing power consumption. The Zynq with optimized CNN is accelerated and analyzed in comparison with other core computing systems, and is applied to character recognition detection and traffic sign detection, respectively, to verify that the Zynq with optimized CNN can quickly perform image recognition processing with very little power consumption, and has a good prospect for development in real-time image processing applications.

Explanation of the CNN foundation

Before optimizing and applying algorithms, it is necessary to have a preliminary understanding of the basic concepts of CNN. In the following section, the basic meaning of CNN will be explained.

CNN Common Operators
Convolutional operators

Convolution operation is one of the most important operators in CNN and is used in the convolutional layer of CNN. The convolution operation here is the same as convolution in image processing, where a convolution kernel performs a weighted summation of the image from left to right and from top to bottom. The difference is that in traditional image processing, the parameters of the convolution kernel are obtained in advance, such as the soble operator for edge detection, which can extract the horizontal and vertical edge features of the image. But here, the convolutional kernel parameters need to be obtained by continuous learning during the training process.

Convolution operation is inspired by this, each of its convolution kernel represents a feature, multiple channels of the convolution kernel on multiple channels of the input image convolution sum to get an output feature map, another set of multi-channel convolution kernel convolution to get another channel of the output feature map, the convolution process is shown in Figure 1. For convenience, the input and output feature maps are uniformly referred to as feature maps, or FMAP for short. The input channel of fmap is the same as the input channel of the convolution kernel filter, and the output channel of fmap is the same as the group number of the filter. Thus, the number of convolutional kernels per layer is the product of the number of input channels and the number of output channels.

Figure 1.

Convolutional diagram

Where C denotes the number of input channels, S denotes the width of the filter, R denotes the height of the filter, W denotes the width of the input feature map, H denotes the height of the input feature map, M denotes the number of channels of the output feature map, F denotes the width of the output feature map, and E denotes the height of the output feature map. Also when there are multiple sets of input feature maps, N will be used to indicate the number of batches, and a batch size of N is not indicated in Fig. 1. The expression for the convolutional layer is shown in equation (1): O[m][x][y]=Activation(B[m]+i=0R1j=0S1k=0C1I[k][Ux+i][Uy+j]×W[m][k][i][j])$$\begin{array}{l} O[m][x][y] \\ = Activation\left( {B[m] + \sum\limits_{i = 0}^{R - 1} {\sum\limits_{j = 0}^{S - 1} {\sum\limits_{k = 0}^{C - 1} I } } [k][Ux + i][Uy + j] \times W[m][k][i][j]} \right) \\ \end{array}$$

where, 0 ≤ m < M, 0 ≤ x < F, 0 ≤ y < E, E = (HR + U)/U, F = (WS + U)/U, U are the step values. The time complexity CconvT$$C_{conv}^T$$ of the convolutional layer computation and the space complexity CconvS$$C_{conv}^S$$ of the weights are CconvT=O(MCRSEF)$$C_{conv}^T = O\left( {M \cdot C \cdot R \cdot S \cdot E \cdot F} \right)$$ Cconvs=O(MCRS)$$C_{conv}^s = O\left( {M \cdot C \cdot R \cdot S} \right)$$

The output of the convolution is usually immediately followed by the activation function, which serves to bring about nonlinearity. The introduction of nonlinear activation functions makes deep neural networks less of a linear combination and theoretically approximate arbitrary functions. It is worth noting that the activation function is often represented in a separate layer of the network when implemented in software; here, since the input and output of the activation function are single variables that do not need to be cached into memory, it is subsumed within the convolutional layer for hardware implementation.

Nowadays, the common CNN models use ReLU as the activation function more. First of all, the form of the ReLU function is very simple, the part greater than 0 is equal to itself, and the parts less than or equal to 0 are all 0. Its derivative is also very simple, only 0 or 1, which is very easy to realize. Second, ReLU’s derivative of 0 or 1 avoids the gradient vanishing problem in CNNs, making training easier to converge. Finally, ReLU will make the output of a part of neurons is 0, which brings the sparsity of the network, and can solve the overfitting problem.ReLU also has a disadvantage, because its derivative has a part of 0, it is easy to have a large number of neurons with the value of 0 when the network is training, which leads to the stop of the training process. Later on, many modified models were proposed, such as Leaky ReLU and Exponential LU, etc. The idea is to set the interval of x < 0 not to be set to a value of 0, but set it to the form of multiplying with the functions of α and x, and to correct α during the training process.

Pooling operator

Pooling abstracts and compresses the results of convolution while preventing overfitting, making the network more robust.The conv layer of a CNN is generally followed by a pooling layer, making it resilient to image translation, scaling, and slight rotation. The pooling operation is very similar to conv, and the commonly used average-pooling and max-pooling is to average or take the maximum operation on the feature map within a 2*2 or 3*3 kernel.

The computational time complexity of pooling is CpoolT=O(MEF)$$C_{pool}^T = O\left( {M \cdot E \cdot F} \right)$$

where M denotes the number of channels of the output feature map, F denotes the width of the output feature map, and E denotes the height of the output feature map.

Fully connected operators

The Fully Connected Layer, also referred to as the FC layer, acts as a “classifier” in a convolutional neural network. It can represent a nonlinear combination of features learned from previous layers by learning weights. It processes the input and outputs a N-dimensional vector, where N corresponds to the number of classifications. Since it is fully connected, every neuron in the previous layer is connected to every neuron in this layer. Assuming that the inputs and outputs are in the form of vectors, the fully connected layer operation can be represented by equation (5). Where all the inputs of the previous layer are represented by vector x, yo is an output element in the output vector y, wo is a corresponding vector in the weight matrix woi$$w_o^i$$, and bias bo is a corresponding element in the bias vector b. Activation denotes the activation function. yo=Activation(xwo+bo)$${y_o} = Activation\left( {x \cdot {w_o} + {b_o}} \right)$$

The fully-connected layer is usually immediately followed by the pooling layer, and in the case of multiple input channels, the fully-connected layer can be viewed as into a convolution operation, which is equivalent to a convolution with a kernel size equal to the size of the input fmap. At this point, the multidimensional input data can also be viewed as a one-dimensional vector and computed according to equation (4). The fully connected layer is computationally intensive and parameter intensive, and the parameters will be much larger than the convolutional layer with the same number of input and output neurons as the convolutional layer. This is because the parameters of the convolutional layer are shared within a single input channel, unlike the fully connected layer where each input neuron has different parameters. The computational time complexity CfcT$$C_{fc}^T$$ and the space complexity CfcS$$C_{fc}^S$$ of the weights for the fully connected layer are respectively CfcT=O(MC)$$C_{fc}^T = O(M \cdot C)$$ Cfcs=O(MC)$$C_{fc}^s = O(M \cdot C)$$

where M is the length of the output vector and C is the length of the input vector.

Softmax operator

The softmax layer is used in the final classification process to map the outputs of multiple neurons to the interval (0,1)$$\left( {0,1} \right)$$, mainly through the normalized exponential operation, and the result after normalization is used as the final classification probability output. Assuming a vector v, vi represents the ith element of v, then the softmax value of this element is Si=evijevj$${S_i} = \frac{{{e^{{v_i}}}}}{{\sum\limits_j {{e^{{v_j}}}} }}$$

The softmax layer allows for a greater probability of output results for large input data and a smaller probability of output results for small input data and normalization.

Parallelism analysis of convolutional computation

Hardware acceleration of convolutional neural networks requires a full study of the parallelism of the CNN architecture, and we have already explored the mathematical modeling of the convolutional layers in the above convolutional neural network architecture. The focus of hardware acceleration of convolutional neural networks is to fully utilize the hardware logic of FPGAs and combine it with the intrinsic parallelism of convolutional neural networks to balance the hardware resources and increase the speed of data computation, as well as the bandwidth and throughput of data transmission. Intra- and inter-neural network layer parallelism are the main parallel mechanisms for CNN forward propagation, the former optimizes the computation speed of specific modules, the latter improves the delay interval of different modules of the neural network, etc. The following is a brief description of the parallelism of convolutional neural networks.

Neural network intra-layer parallelism analysis

In convolutional neural network forward propagation, the convolutional module has a high similarity to the pooling module in that both have outstanding sliding parallelism of windows, which can be generalized to other layers using the convolutional layer as an example. The huge multiply-accumulate operation of the convolutional layer consumes a major portion of the computational resources of the entire system, and the development of the parallelism of the convolutional layer is thus a major part of the overall forward propagation parallelism. The essential operation of convolution is multiply-accumulate, and the principle of convolution operation is shown in Fig. 2, which has the following four main parallel characteristics.

Figure 2.

Multiplication-cum-operation diagram

Parallelism within the convolutional window

i.e., parallelism of operations inside the input feature map and the same convolution kernel.

Parallelism on the same input feature map

The operations of the same convolutional kernel at different locations on the same input feature map.

Parallelism on different input feature maps

The front stage N input feature map group and the corresponding N convolution kernel group corresponding operation to get an output feature map, these different input feature map channels do not have correlation, can be executed at the same time, different channels of the input feature map and the matching convolution kernel dot product, cumulative that is, to get the output features.

Parallelism of different output feature maps

The output features with the same position on different output feature map channels have the same position in the convolution or pooling window of the previous layer, i.e., the neurons at the same position on different output feature map channels can also be operated in parallel.

Neural network inter-layer parallelism analysis

There is a correlation between CNN layers, and the extracted output features of the previous layer are often the input features of the next layer. This correlation increases the difficulty of parallelism, so inter-layer parallelism is different from intra-layer parallelism, inter-layer parallelism focuses on the overall performance of the system, allowing functions or tasks to be executed in an overlapping manner, improving the throughput of the overall design, reducing the inter-task interval, and increasing the speed to improve the computing performance from the whole process.

CNN optimization based on embedded systems

After understanding the basic connotation of CNN, the following section further elaborates on the embedded platform Zynq and optimizes the core computation of CNN - convolutional computation based on the Im2col-Gemm algorithm of the Darknet framework, to improve the computing speed of the CNN algorithm and to reduce the power consumption of its computation.

Zynq-based development flow
FPGA Basics

FPGA is a programmable logic gate array that can be programmed and configured by the user to perform various logic functions. Due to its advantages of high flexibility, reconfigurability, and the ability to realize high-speed computing, it is widely used in a variety of fields such as digital signal processing, communications, image processing, and control systems. Here are the basics of FPGAs.

Structure of FPGA: FPGA consists of programmable logic units, input/output blocks, clock management units, digital signal processor multipliers, etc. These units are connected through programmable interconnect resources.

Programming of FPGA: FPGAs are programmed in two ways, one using hardware description languages such as Verilog and VHDL, and the other using high-level synthesis tools HLS, such as VIVADO HLS.

Timing analysis of FPGA: Due to the programmability of logic units and interconnect resources in FPGA, timing analysis is a very important part of FPGA design, which requires the setting of timing constraints and the verification of timing analysis.

FPGA design flow: The design flow of FPGA usually includes design, synthesis, realization and verification, etc., in which synthesis and realization are the core links in FPGA design.

In summary, FPGA is a programmable logic device, which is one of the important tools in the design of digital circuits, and is now widely used in many fields.

Zynq Architecture

Zynq UltraScale+ is a highly integrated programmable SoC (system-on-chip) from Xilinx that combines the latest generation of Xilinx’s UltraScale+ FPGAs (field-programmable gate arrays) and ARMCortex-A53/Cortex-R5 multicore processors, which utilize the TSMC 16nm FinFET +process with excellent performance and low power consumption. It supports many different types of IO standards, including PCI Express Gen3, USB3.0, Gigabit Ethernet, DisplayPort, etc. It also integrates high-speed memory controllers, video codecs, digital signal processors, and a variety of peripheral interfaces, providing users with a wealth of hardware resources and interfaces.Zynq UltraScale+ MPSOC is also equipped with programmable logic and a hardware debugger that allows users to design, simulate and debug using the VIVADO development kit. In addition, Xilinx provides a wealth of development tools and software libraries to help users quickly develop efficient and reliable systems. Currently Xilinx launched products equipped with this type of chip include Ultra 96 v2, ZCU102, ZCU104 and ZCU106, which are mainly used in machine vision, surveillance, automobile automatic driving and other scenarios. Figure 3 shows the architecture of the ZCU104. The architecture diagram shows a quad-core ARM Cortex-A53 and a dual-core ARM Cortex-R5, as well as a DDR4 memory controller, on the ZCU104 development board. In addition, the development board includes a programmable logic area FPGA and multiple high-speed serial connectivity ports, including PCI Express and DisplayPort.MIO and HP are multiple input/output ports in the processor subsystem that can be used to communicate with external devices.

Figure 3.

Architecture diagram of ZCU104

Overall, the Zynq UltraScale+ fabric integrates high-performance FPGAs and multi-core processors, providing powerful computing and communication capabilities for a variety of application scenarios, making it an excellent programmable SoC chip.

CNN Optimization Design Based on Darknet Framework
Introduction to the Darknet Framework

In order to simplify the difficulty of implementing CNN algorithms, deep learning frameworks are needed to accomplish the deployment of their models. Deep learning frameworks encapsulate the regular operations of the algorithm into operators that can be called directly, and encapsulate the hardware architecture to reduce the complexity of the algorithm and the difficulty of development.Darknet framework is a lightweight, open source, language-based neural network framework that supports both CPU and GPU computation, and can be freely coded to configure the configuration file to customize the structure of the network and so on.The Darknet framework has the advantages of easy installation, good portability, no dependencies, etc. It is ideally suited for the deployment of convolutional neural networks on embedded platforms. The Darknet framework has the advantages of easy installation, good portability, no dependencies, etc., and is very suitable for the deployment of convolutional neural networks in embedded platforms.The block diagram of the Darknet architecture is shown in Figure 4.

Figure 4.

Darknet framework block diagram

Design of Im2col-Gemm Algorithm Based on Darknet Framework

In the task of image processing, Im2col is a kind of transformation process to convert pictures into matrix columns, which is one of the earliest important ways to perform convolution for convolutional neural networks in the Caffe framework. In Darknet framework, Im2col is also used to convert the 3D tensor to a 2D matrix, and then make full use of the generalized matrix multiplication algorithm to perform the convolution operation for different convolutional neural network models, and finally convert the obtained 2D matrix results to a 3D matrix output through Col2im. Compared with the traditional direct convolutional computation, the Im2col+Gemm approach is widely used in current convolutional neural networks.

Usually, when processing an image, the number of channels of the image also needs to be considered, and the multi-channel convolution calculation process is a convolution calculation of a multi-channel input feature map with multiple convolution kernels in a three-dimensional space. The three-dimensional convolution can be simply expressed as a simple superposition of the two-dimensional convolution, which is calculated as shown in Equation (9). Where Z is the convolution output, I is the convolution input, K is the convolution kernel, c is the input channel, h is the image height, w is the image width. Z(i,j)=I(i,j)*K(i,j)=chwIc(ih,jw)Kc(h,w)$$Z(i,j) = {I_{(i,j)}}*{K_{(i,j)}} = \sum\limits_c {\sum\limits_h {\sum\limits_w {{I_c}} } } \left( {i - h,j - w} \right){K_c}\left( {h,w} \right)$$

Im2col is to transform the input feature map into a matrix for convolutional computation by sacrificing space, i.e., converting a three-dimensional matrix into a two-dimensional matrix by spatial dimensionality reduction. The idea of the transformation is that the algorithm implements the process by arranging all the data needed for each loop into column vectors, and then splicing them into a matrix in the column direction according to the order of the channels.

It can be seen from the algorithm that im2col reduces the number of loop layers by reducing the data dimensions, thus reducing the complexity of time. Meanwhile, after the three-dimensional matrix is transformed into a two-dimensional matrix, the data required for the operation is stored in the computer in a continuous memory, which greatly improves access speed. Moreover, the number of data elements after dimensionality reduction is greater than the number of data before dimensionality reduction, resulting in increased memory usage. So Im2col is an algorithm that sacrifices space for time.

After converting the 3D tensor into a 2D matrix by Im2col algorithm, it is necessary to call the Gemm computational library to compute the convolution of the input matrix with the convolution kernel.Gemm is an important way to compute matrices in deep learning. More than 99% of the computations in convolutional neural networks are in the convolutional and fully connected layers, and the matrix computations in these two layers are basically realized by Gemm.

From the matrix multiplication algorithm, it can be seen that the time complexity of its algorithm is O(n3)$$O\left( {{n^3}} \right)$$. Its complex multiply-accumulate operation will occupy a large amount of computational resources in the embedded platform, and the resources in the embedded platform are limited, so it needs to be optimized for Gemm. Therefore, it can be considered from the following two aspects, the first is based on the algorithm analysis approach, according to the computational characteristics of matrix multiplication, optimized from the mathematical point of view, such as Winograd algorithm. Secondly, it can also be optimized based on the software method, which selectively adjusts the computation order according to the hierarchical structure characteristics of the computer storage system system, mainly including loop splitting, vectorization, memory rearrangement and so on.

Optimized CNN testing and applications

After completing the basic connotation of CNN and algorithm optimization ideas, this paper will combine the optimized CNN Zynq with CPU and GPU for acceleration test, to make it clear that the optimized CNN Zynq can effectively improve the computing speed and reduce the computing power consumption. The advantages of Zynq in real-time image processing are further validated by character recognition detection and traffic sign detection. The following section describes the tests and applications.

Optimization Acceleration Test
Acceleration ratio analysis

A total of 3 groups of parameters are used for experimental comparison. The first group uses only the Cortex-A9 single core at the PS side of the Zynq chip to test the performance of the CNN gas pedal, and the PL side of the Zynq is not involved in the operation; the second group uses the Intel i7-9750H model CPU to implement the CNN network; and the third group adopts the Zynq-7035 chip in combination with optimized CNN, which is co-designed by the software and hardware at the PS side and the PL side to accelerate the the computing process of CNN network. The CNN networks built under different hardware configurations are compared, and the average time spent on each layer and the total time spent on CNN network operation are recorded, and the relevant test results are shown in Table 1.

Comparison of CNN network operation time in different hardware

Operation time/s Cortex-A9 single-core Intel CPU Zynq-7035
Conv1+Pool1 20.5745 0.9790 0.1619
Conv2+Pool2 161.4176 5.4650 0.2265
Conv3 160.3363 5.4470 0.2023
Conv4+Pool3 320.6059 10.9780 0.3118
Conv5 159.0134 5.5630 0.1875
Conv6+Pool4 318.0369 11.0760 0.3334
Conv7 79.7953 2.7340 0.1552
Conv8+Pool5 79.7529 2.7350 0.1497
FC1 4.5438 0.0530 0.2569
FC2 0.0261 0.0000 0.0021
FC3 0.0002 0.0000 0.0000
Total time 1304.1029 45.03 1.9858
Total duration ratio ×658.12 ×23.18 ×1.0
Convolution layer duration ratio 99.72% 99.69% 87.15%

The PS side and PL side of the Zynq-7035 chip combine the optimized CNN to accelerate the CNN network by adopting cyclic chunking, cyclic water flow, cyclic unrolling, and cache optimization, and embedding the computation of convolutional, pooling, and fully-connected layers in the PL side of the Zynq, which has a high degree of parallelization and an improved computation rate of the CNN network, even though its operating frequency is only 100 MHz. On the other hand, the degree of parallelization of CNN network under Cortex-A9 single core and Intel CPU is low, and the operating frequency of Cortex-A9 single core is low, both of which are not suitable for operations with high parallelism.

From Table 1, it can be seen that compared to the Cortex-A9 single core and Intel CPU realizing the whole CNN network, the PL-side accelerated CNN network can achieve 658.12 and 23.18 times speedup, respectively. The convolutional layer operation occupies most of the computing time of the CNN network, no less than 95% of the total time, because the computation of the convolutional layer is significantly larger than that of other layers. Although there are more fully connected layer parameters, the computation is still simple and fast.

In different convolutional layers, the first convolutional layer has less computation time, although the size of input and output features is larger, but the number of input and output channels is smaller, so the computation process of the first convolutional layer can be completed in a shorter time; on the contrary, the last convolutional layer, although the number of input and output channels is larger, but the size of the input and output features is smaller, and there is not a great computation, and the computation spent In contrast, the last convolutional layer has more input and output channels, but the size of the input and output features is small, which is not very computationally intensive and the computation takes less time; the 4th and 6th convolutional layers, which are in the middle position, do not have the largest number of input and output channels and sizes of the features, but the computation process is more cumbersome, and taking the computation of the pooling layer into account, the calculation takes more time. Among the different fully-connected layers, the first fully-connected layer takes the most time to compute, and the latter two layers have a rapidly decreasing time to compute because of the sharp decrease in the number of input and output channels.

Analysis of resource utilization

When the Zynq-7035 chip is used to realize CNN acceleration, the resources of the PS side and PL side of the chip will be used at the same time, and after the PL side of the Zynq generates hardware circuits, the corresponding FPGAs with various resource occupancy are shown in Figure 5. In this paper, when optimizing the algorithm and subsequently designing the CNN gas pedal, the input features, weights and biases stored in the external memory DDR3 are loaded into the PL side of Zynq as much as possible to reduce the amount of data transfer between the PL side and the external memory. While the FPGA commonly stores IP cores as FIFO or RAM, these need to occupy BRAM resources. Figure 5 shows the utilization of different resources for CNN acceleration.

Figure 5.

Diagram of different resources used by CNN acceleration

From Fig. 5, it can be seen that the CNN acceleration takes up the most BRAM resources, accounting for 90% of the total, which is consistent with the design requirements. In addition, the DSP resources in the CNN acceleration also occupy a lot of resources, mainly realizing the multiplication and addition operations in the convolutional layer and the fully connected layer, and the other resources occupy less.

Real-time image processing applications
Character Recognition Detection

Zynq combined with optimized CNN is applied in person recognition detection, and hardware-level acceleration of target detection is carried out using images from different scenarios, such as multi-person scenarios, small target scenarios, and mask recognition scenarios. The experiments are summarized in Table 2:

Comparison of object detection hardware-level acceleration experiments

CPU GPU Zynq
Experimental platform Intel core i510400f NVIDIA GTX 3060 ZU5EV
Development language C C Verilog HDL
mAP 71% 71% 69%
Data accuracy Float32 Float32 INT8
FPS 30 355 220
Power consumption(W) 70W 185W 4.5W
Handling capacity(GOP) 0.94 202 145
Energy efficiency ratio(GOP/W) 0.15 1.13 31.4
Clock frequency 2.8GHz 1321MHz 201MHz

In order to verify the Zynq performance of combining optimized CNNs, tests of Zynq under CPU, GPU and FPGA platforms with the same images were carried out, where the CPU and GPU platforms were tested on the COCO dataset under the Pytorch framework, and the arithmetic data types were all Float32. From Table 2, it can be seen that, in the case of changing only the data quantization, the recognition of the mAP is not greatly cut down. However, in terms of real-time performance, the FPS of Zynq combined with optimized CNN can reach 220, and it only takes about 4.5ms to recognize a graph, compared with the 355FPS achieved by GPU, this design can achieve 220FPS with less than 4.5W power consumption, which is much lower than the 185W of GPU, while CPU, due to the characteristics of its serial processing, can only get Due to its serial processing nature, the CPU only achieves 30FPS and consumes 70W. Since the performance gap between the ARM side of Zynq and the CPU is too large, it is not included in the discussion. Combining optimized CNNs, Zynq achieves slightly lower acceleration than GPUs and much higher acceleration than CPUs with 2.5/100 of the power consumption of GPUs and 6.4/100 of CPUs. Overall, the acceleration of CNNs on Zynq development boards with hardware and software collaboration is realized.

Combined with the optimized CNN of Zynq in the specific implementation process will consume a large number of FPGA logic resources, in the development tool Vivado to generate the hardware resource consumption report, the resource consumption statistics are shown in Table 3. Combined with Table 3 can be seen, a total of 336 DSPs are consumed, BRAM resources 93.6, LUT resources 54,891, FF resources 66,130.

Logical resource consumption statistics

LUT LUTRAM DSP BRAM36K FF
Convolution accelerator 28.9K 14.7K 331 43 17.6K
ARM soft core 14.8K 0 5 19 3.2K
AXI DMA 28.1K 2.21K 0 31.6 6.7K
total 71.8K 16.91K 336 93.6 27.5K
Traffic sign detection

In order to evaluate the processing capability of Zynq combined with optimized CNNs, performance tests are conducted in this paper. Since the model is trained based on the GTSDB dataset, the previously segmented test set is used as the test data with a total of 100 sheets and the number of traffic signs it contains is counted. In addition, in order to make the test results more reliable, this paper carries out 5 times of random segmentation on the produced dataset and completes the corresponding model training, and its test set is then used as the algorithm test data.

After a number of performance tests to get Table 4, from Table 4 can be seen in the algorithm in the process of detecting and identifying traffic signs, for the detection of traffic signs, there are a certain number of leakage cases, but the leakage rate is kept at a low level of 8.28%, analyzing the reason is mainly due to the candidate frame extraction algorithm is affected by certain complex environments, such as extreme illumination, before and after the scene similarity, and then the training of the model is not For the recognition of traffic signs, the average recognition rate of 97.8% shows better recognition performance, thus reflecting the great advantage of CNN network in solving the classification problem. For traffic sign image single detection time consuming, although the single time consuming given by the performance test has not reached the industrial level of real-time, it is enough to confirm in order to evaluate the feasibility of the application of Zynq combined with optimized CNN in real-time recognition of traffic sign detection.

CNN performance test results

t number 1 2 3 4 5
Sample size 100 100 100 100 100
True number 130 159 142 129 155
Number of omissions 6 17 9 10 19
Missing rate 4.6% 10.7% 6.3% 7.6% 12.2%
Average missed detection rate 8.28%
Identification number 128 154 138 128 151
Recognition rate 98.4% 96.8% 97.2% 99.2% 97.4%
Average recognition rate 97.8%
Sheet time 1.23s 1.46s 0.92s 1.12s 1.24s
Average time spent per sheet 1.19s

Comprehensively, the above test results and analysis show that Zynq combined with optimized CNN has a better detection and recognition ability, which basically meets the demand index of the initial application. Through subsequent research and improvement, its detection and recognition level can be further improved so as to meet the real real-time image processing requirements.

Conclusion

This paper studies the optimization and application of CNN based on embedded systems. The Im2col-Gemm algorithm based on the Darknet framework is optimized to improve the computing speed of CNN and reduce computing power consumption. Deploying the optimized CNN on different embedded systems, it is found that Zynq achieves 658.12 and 23.18 times speedup compared to Cortex-A9 single core and Intel CPU, respectively. The Zynq combined with optimized CNN is applied to character recognition detection and traffic sign detection, and by analyzing the data, it is known that the optimized character recognition FPS can reach 220, and it only takes about 4.5ms to recognize a picture, and the traffic sign recognition rate is as high as 97.8% while the leakage rate is as low as 8.28%.

Combined with the research in this paper, it can be seen that the optimized CNN algorithm based on the embedded system can effectively improve the speed of convolutional operation, reduce the power consumption of convolutional operation, improve the real-time image recognition rate, while reducing the real-time image leakage rate, which has a broader application prospect in the field of AI.