Disclosure of Invention
The invention solves the problems: the system and the method overcome the defects of the prior art, provide a dynamic stepless despinning system and method based on high-level integration of a large-scale integrated circuit, and realize stepless despinning processing with high precision, large range, high real-time performance and high output image quality by utilizing the characteristics of FPGA parallel acceleration and pipeline optimization based on the high-level integration technology. The precision can reach 0.001 degrees, namely racemization treatment can be carried out on the extremely small angle; the racemization range is 0-360 degrees, namely, racemization treatment can be carried out on any angle; the processing time of one frame of image is less than 12ms, so that real-time despun processing can be realized; and a bilinear interpolation method is adopted for despinning, so that the image is smooth and has no saw teeth, and the image output quality is high. The method is limited in that the contradiction between the real-time performance, the precision, the range and the image quality exists in the prior art, so that the prior art can only realize one or more of the indexes independently and cannot realize all the technical indexes simultaneously, and therefore, the method has high engineering application value.
The technical solution of the invention is as follows: a dynamic non-polar despinning system based on high-level synthesis of a large-scale integrated circuit is designed based on a high-level synthesis method of the large-scale integrated circuit, and has the following innovation points as the core of the invention as a whole: 1) utilizing a high-level comprehensive technology, namely using C + + and other high-level languages to carry out FPGA algorithm design optimization and resource scheduling; 2) the algorithm flow line is accelerated and optimized, the data throughput is improved, the time delay is greatly reduced, and the real-time property of image despinning is improved; 3) the high-bandwidth real-time parallel optimization of the multiple AXI buses improves the data reading and writing efficiency and the algorithm real-time performance; 4) a four-in-one module for four-pixel combination is designed, namely four 8-bit pixel points for bilinear interpolation are combined into 32-bit data, so that the function of reading four pixel points at one time can be realized at the later stage, and the high delay caused by repeated data reading is greatly reduced.
The system comprises a video acquisition module, a video decoding module, a core processing module and a video coding module; the core processing module adopts a heterogeneous system on chip with an FPGA and ARM architecture and is a Zynq UltraScale + MPSoC15EG chip; the FPGA comprises a dynamic non-polar despun module, a video-to-AXI bus video stream module, an AXI video stream DDR read-write module and a pixel merging module which is innovatively designed for reducing algorithm delay and improving bus bandwidth utilization rate, namely a four-in-one module; the ARM comprises a video storage module DDR and an RS422 serial port communication module, and data communication between the FPGA and the ARM is carried out by adopting an AXI control bus;
the video acquisition module is used for acquiring an original video image by using a camera, wherein the video image is data to be despuned; the original video image after the acquisition enters a video decoding module;
the video decoding module is used for converting serial videos acquired by the camera into parallel video data and obtaining a series of dominant video synchronous signals, and the parallel video data and the synchronous signals obtained by decoding are sent to the FPGA;
in the FPGA, firstly, a video-to-AXI bus video stream module converts video data into AXI bus video stream data with lower delay and better benefit for realizing data synchronization and pipeline acceleration optimization. The data in the AXI bus video stream format then flows into the four-in-one module of the inventive design, due to the subsequent derotation process of bilinear interpolation, each pixel is read from the DDR for the four pixels immediately adjacent to each pixel, the delay caused by pixel reading is considerable, since multiple pixel readings will cause higher delay, the present invention designs a four-in-one module, that is, every time the data stream flows into two rows, the data stream is cached in the on-chip cache, four 8-bit pixel points around each pixel are merged into one 32-bit data, and when four pixels adjacent to a certain pixel need to be read subsequently, the merged 32-bit pixel needs to be read only once, and divided into four independent 8-bit data, the function of reading four pixel points at a time can be realized, and the processing can fully utilize the AXI bus bandwidth to reduce the time delay to one fourth of the original time delay. Then caching the merged 32-bit video stream data into DDR of an ARM through an AXI video stream DDR read-write module;
the dynamic non-polar despinning module is used for dynamically performing non-polar despinning on video data in a video data stream cached in the DDR according to a despinning instruction and a despinning angle sent by the upper computer through the RS422 serial port communication module, the four-in-one module is matched during despinning processing, 32-bit data read from the DDR is divided into four 8-bit data to perform bilinear interpolation, and a processed video image is still stored in the DDR; and reading the cached deswirled video image from the DDR into the AXI video stream again by using the AXI video stream DDR read-write module, converting the AXI video stream into parallel video data with dominant synchronous signals by using the AXI bus video stream video module, and sending the parallel video data into the video coding module for coding and outputting to a display or an acquisition card for real-time display.
The used image electronic racemization algorithm based on bilinear interpolation is as follows:
(1) and solving the coordinates (x, y) of each pixel point (x ', y') of the image after racemization processing corresponding to the pixel point of the image before racemization processing according to the racemization angle sent by the upper computer. The formula is as follows:
wherein theta is a rotation angle,
is a rotation matrix.
Generally set to the center of the image (x)0,y0) For rotation of the center of rotation, the above formula should be rewritten as:
writing the above formula as a scalar:
(2) and (4) performing pixel mapping by using a bilinear interpolation method. Since the pixel coordinates (x, y) mapped to the original image calculated in step (1) are often not integers, the pixel mapping cannot be directly performed according to a one-to-one relationship. The non-integer pixel coordinate problem occurring in the mapping process is generally solved by adopting a resampling mode.
According to the image reconstruction theory, three common interpolation methods are generally adopted for image mapping: nearest neighbor interpolation, bilinear interpolation, and cubic interpolation. The interpolation effect of the nearest neighbor interpolation method is poor, and the despiralized image has obvious saw tooth effect and burr phenomenon; the bilinear interpolation method and the cubic interpolation method have good effect, and the gray scale is continuous without saw teeth. The cubic interpolation method has complex algorithm and overlong calculation time, so that the real-time requirement is difficult to meet in practical engineering application. Therefore, the image despinning algorithm based on the bilinear interpolation method is finally selected and used in the invention in consideration of the compromise between the despinning precision and the system real-time property.
The schematic diagram of the electronic racemization algorithm based on bilinear interpolation is shown in fig. 2. The method carries out linear interpolation in the x direction and the y direction according to the gray values of 4 points around the integer coordinate point of the non-integer sampling point. In fig. 2, (x, y) is a pixel coordinate obtained by bilinear interpolation, f (x, y) is a pixel gray scale value at the coordinate (x, y), f (0,0), f (1,0), f (0,1), and f (1,1) is a pixel gray scale value of 4 points around (x, y), so that the calculation formula of the bilinear interpolation method can be obtained as follows:
f(x,y)=[f(1,0)-f(0,0)]x+[f(0,1)-f(0,0)]y+[f(1,1)-f(1,0)-f(0,1)-f(0,0)]xy+f(0,0)
(3) and determining the image boundary after racemization. The size of the image after rotation is typically changed from before rotation, and therefore the image boundary needs to be re-determined. The determination of the four boundary positions of the upper, lower, left and right of the image is calculated according to the following formula:
left=max(x1,x2,x3,x4)
right=min(x1,x2,x3,x4)
top=max(y1,y2,y3,y4)
bottom=min(y1,y2,y3,y4)
(4) the image resolution is fixed. In practical engineering application, the resolution of an output image is often fixed, and after despinning operations of different despin angles are performed on an original video image, the resolution of the image is bound to change and the resolution cannot be fixed, so that the invention aims at clipping the despinned image by taking the image center as the center, and fixing the resolution of the output image, namely keeping the same size of the output image.
The invention focuses on realizing dynamic stepless racemization based on a high-level comprehensive technology of a large-scale integrated circuit, which is an important guarantee for the real-time performance of a high-resolution system and is also the most important innovation point of the invention.
Compared with the prior art, the invention has the advantages that:
(1) the invention innovatively designs a four-in-one module, namely the advantages of high-bandwidth data flow are fully utilized, the data flow is cached in an on-chip cache every two lines flow in, four 8-bit pixel points around each pixel are merged into one 32-bit data and cached into a DDR (double data rate) in a data flow mode, then when a certain pixel point is despund in a bilinear interpolation mode, the 32-bit data can be taken out and divided into four 8-bit pixel points, namely four pixel points required by the bilinear interpolation, the function of reading the four pixel points at one time can be realized, the time delay of an algorithm can be reduced to one fourth of the original time, the processing time delay is the same as the despun processing of nearest neighbor interpolation, but the processing effect is much better than the despun processing of the nearest neighbor interpolation.
(2) And (4) accelerating and optimizing an algorithm pipeline. Compared with a general embedded system, the large-scale integrated circuit FPGA has the great advantage that the algorithm can be optimized in a data pipelining mode, so that the algorithm is compiled in a pipelining mode, when algorithm development is carried out in a Vivado HLS development tool, a precompiled instruction pipeline (pipelining optimization instruction) is used, and the compiled program is ensured to be in accordance with data input, data use and data output once, namely, one piece of data can be input once and used once, and finally, the data flow is prevented from being blocked by a pipelining programming principle that the data must be output and output once, namely, the algorithm can be subjected to pipelining processing in a mode of sacrificing hardware logic resources.
In particular, pipelining allows operations to be performed in parallel, with each execution step not having to wait for all operations to complete before starting the next operation. Pipelining is suitable for functions and cycles, taking circulating pipeline optimization as an example, variables in each cycle relate to three operations of reading, calculating and writing, before pipeline optimization is not performed, the three operations are executed according to a serial sequence, input is read once every 3 clock cycles, and values are output after 2 clock cycles; after the pipeline optimization is carried out, a read operation is executed once in each clock, and multiple groups of data are executed in a parallel mode. The delay conditions before and after pipeline optimization are shown in fig. 3, before pipeline optimization is carried out, 3 clock cycles are needed between two read operations, and the last write operation can be executed after 8 clock cycles; after the pipeline optimization is carried out, 1 clock cycle is needed between two reading operations, the last writing operation can be executed after 4 clock cycles, the pipeline optimization of the visible algorithm improves the data throughput, greatly reduces the time delay and improves the real-time property of image despinning.
(3) Multiple AXI high bandwidth buses are optimized in real-time in parallel. The invention aims to solve the problem that real-time despinning processing is realized on a high-resolution image, and the space of a chip cache (BRAM) of an FPGA chip is limited and is not enough to cache a whole frame of high-resolution image, so that a 64-bit 128MB DDR chip is hung externally at an ARM embedded end and is used for image caching. Different from direct caching in BRAM, because the DDR is externally hung at the ARM end, the FPGA chip needs to read and write data from the FPGA end to the DDR of the ARM end through the AXI bus. As can be derived from analysis and actual measurement of the delay, since the algorithm has been pipeline optimized in (1) and the delay of the racemization algorithm itself has been reduced to a lower level, the delay mainly results from reading and writing data from the DDR over the AXI bus. The FPGA + ARM processing architecture chip used by the invention is Zynq UltraScale + MPSOC15EG, and has abundant AXI bus resources (7 128-bit AXI buses), so that the invention uses a parallel processing mode of a plurality of AXI high-bandwidth buses to read and write and process a plurality of pixel points simultaneously, thereby greatly reducing time delay, increasing data throughput and improving algorithm real-time property. Finally, the invention uses 2 128-bit buses and 1 64-bit bus to carry out multi-bus parallel processing, aiming at 1080p gray level images, the whole time delay of executing bilinear interpolation despinning algorithm in the range of 360 degrees is 12ms, no matter aiming at 30fps video images or 60fps video images, the despinning operation can be completed in one frame time, namely the real-time despinning processing of high resolution images is realized. Meanwhile, the invention only occupies 36% of bus resources, namely, the racemization of 1080p images is realized, so that the resolution of real-time racemization of the images can be further improved by continuously increasing the use of the bus.
(4) And (3) realizing algorithm design optimization and resource scheduling by using a high-level comprehensive technology. The Zynq UltraScale + MPSOC15EG processing chip used by the invention is a heterogeneous embedded chip developed by Xilinx company, is developed by using a Vivado development kit, comprises a high-level development tool Vivado HLS, can use a high-level language (C/C + +/System C) to carry out algorithm development and optimization design according to specific specifications under an HLS development framework, and finally converts the high-level language program into a hardware description language (Verilog HDL/VHDL) program by using the HLS tool. By using a high-level comprehensive tool for development, algorithm design optimization and dynamic scheduling of logic resources can be conveniently performed, the development efficiency is greatly improved, the parallel computing advantages of multiple AXI buses of an FPGA + ARM architecture and the acceleration characteristics of multiple pipelines are fully exerted, and the despinning algorithm performance is remarkably improved. The invention carries out design balance from the aspects of logic resource occupation, delay, throughput and the like, and because the chip hardware used by the invention has richer logic resources, the logic resource occupation is determined to be sacrificed to realize lower algorithm delay and higher data throughput. The invention fully utilizes the advantages of HLS and improves the performance of the racemization algorithm from the aspects of data type optimization and data throughput optimization. Specifically, in the aspect of data type optimization, 20-bit-width data is used for multiple times, however, the data type bit width of the standard C is an integral multiple of 8 bits, and if the integer data with the bit width of 32 bits is directly used, the waste of logic resources is caused, and the advantages of high performance and strong parallel capability of the FPGA cannot be exerted, so that the invention defines one 20-bit-width data by using a mode defined by any bit-width data provided by an HLS tool, and greatly saves the use of the logic resources. The invention discloses a data throughput optimization method, which performs pipeline optimization and cycle expansion optimization on a cycle according to the idea of changing the speed by area, improves the throughput of an algorithm at the cost of sacrificing logic resources and improves the performance of the algorithm.
(5) Through practical tests, real-time despinning can be realized for 1920 x 1080 visible light images, the despinning range is 0-360 degrees, the delay is less than 12ms, the accuracy of the despinning angle can reach 0.001 degrees, the maximum pixel error is less than 1 pixel, and the whole system has the excellent characteristics of high video resolution, large despinning range, high despinning accuracy, clear and non-sawtooth processed images, low output delay, strong system stability, easiness in processing, low power consumption, small size and the like.
Detailed Description
The following further describes the embodiments of the present invention with reference to the drawings.
As shown in fig. 1, the despinning system of the present invention includes a video capture module, a video decoding module, a core processing module and a video encoding module; the core processing module adopts a heterogeneous system on chip with an FPGA + ARM architecture; the FPGA comprises a dynamic despinless module, a video-to-AXI bus video stream module, an AXI video stream DDR read-write module and a pixel merging module which is innovatively designed for reducing algorithm delay and improving bus bandwidth utilization rate, namely a four-in-one module; the ARM comprises a video storage module DDR and an RS422 serial port communication module, and data communication between the FPGA and the ARM is carried out by adopting an AXI control bus.
The video acquisition module is an industrial camera, the resolution is 1920 multiplied by 1080, the frame frequency is 30Hz or 60Hz, and the video output format is not limited. The video decoding module uses a video decoding chip and has the function of converting an input serial video signal into a parallel format video, a data effective signal DE, a line synchronizing signal HSYNC and a field synchronizing signal VSYNC, and transmitting the data signal, the effective signal and the synchronizing signal to the FPGA for subsequent processing. The video storage module adopts 4 DDR4 of 16-bit 128MB to combine into DDR of 64-bit 128MB, because racemization processing needs whole frame image caching, and the on-chip cache space in FPGA is smaller, which is not enough to store whole frame image, therefore, external memory is needed, the invention finally selects to store DDR on ARM end of Zynq chip, thus being more beneficial to subsequent operation. The data communication module mainly comprises two parts, one part is communication between the electronic despin system designed by the invention and the main control of the upper computer, the communication is designed based on RS422, and the stable low-speed transmission protocol can meet the transmission of the despin angle in the system; the other is communication between an FPGA end and an ARM end in the Zynq chip, and the communication between the FPGA end and the ARM end adopts an AXI bus communication protocol provided by Xilinx to transmit instruction information and image information through an AXI bus. The video coding module is a video coding chip and is used for converting parallel video data, a data effective signal DE, a line synchronizing signal HSYNC and a field synchronizing signal VSYNC into serial video signals to be output, and finally outputting the serial video signals to a display or an acquisition card to be displayed in real time. The model of a core processing module of the system is Zynq UltraScale + MPSOC15EG, and the Zynq framework chip can fully exert the parallel acceleration function of the FPGA end and the master control scheduling function of the ARM end, and is one of the mainstream chips of the existing heterogeneous system-on-chip. The core of the invention is a four-in-one module and a dynamic non-polar despun module, the algorithm of the dynamic non-polar despun module is deployed at the FPGA end, and the memory scheduling and the communication with the upper computer are carried out at the ARM end.
The invention specifically comprises the following steps:
the method comprises the following steps: video capture and decoding
The invention adopts an industrial camera to collect video images, and carries out video decoding through a decoding chip to obtain parallel videos, a data effective signal DE, a line synchronizing signal HSYNC and a field synchronizing signal VSYNC. The invention is designed based on FPGA AXI data stream, therefore, the related signals obtained by decoding need to be sent to a video to AXI bus video stream module, and parallel video data are converted into AXI bus video stream data, thereby being convenient for realizing the accelerated optimization of the production line in later period with high efficiency.
Step two: immediate neighbor pixel merging
The invention innovatively designs a four-in-one module, a data stream is cached in an on-chip cache every two lines flow in, four 8-bit pixel points adjacent to each pixel are merged into one 32-bit data, then when racemization of bilinear interpolation is carried out on a certain pixel point, the 32-bit data can be taken out and divided into four 8-bit pixel points, namely four pixel points required by the bilinear interpolation, and the function of reading the four pixel points at one time can be realized, so that the algorithm delay can be reduced to one fourth of the original delay.
Step three: video data storage
Caching the 32-bit video stream data merged in the step two into DDR of the ARM through the AXI video stream DDR read-write module;
step four: real-time dynamic non-polar despinning processing of video data
The flow chart of the dynamic non-polar racemization processing module is shown in figure 4. The invention designs a dynamic non-polar racemization algorithm by using Vivado high-level comprehensive technology, and encapsulates the dynamic non-polar racemization algorithm into an IP core, the IP core defines two m _ AXI (AXI host) ports which are respectively used for reading and writing DDR4, the m _ AXI reading port is used for reading original pixel information from a frame buffer area of DDR4 through an AXI bus, after dynamic non-polar racemization processing is carried out through the racemization algorithm, the m _ AXI writing port is used for outputting the original pixel information to another frame buffer area of the DDR, and therefore, the whole process of image racemization is completed.
Step five: video encoding and output display
After the despinning processing in the fourth step, the despinned image is cached in a cache area of the DDR, the cached despinned video image is read into the AXI video stream from the DDR by using the AXI video stream DDR read-write module again, the AXI video stream is converted into parallel video data with a dominant synchronous signal by using the AXI bus video stream video module, and the parallel video data is sent into a video coding chip to be coded and output to a monitor or an acquisition card to carry out real-time display of the despinned result.
According to the steps, the host computer gives any racemization angle, and the system can output a racemization result in real time. For example, the upper computer rotates the despin angle clockwise by 0.625 °, and the images before and after being processed by the despin system are shown in fig. 5. Fig. 5 (a) is an original image before racemization, it can be seen that the image has a tilt in the horizontal direction, that is, the optical axis is not accurately balanced, and a rotation angle in the counterclockwise direction exists, and the rotation angle is 0.625 ° as measured by the upper computer, so the upper computer issues a racemization angle of 0.625 ° to the racemization system, and as shown in fig. 5 (b), it can be seen that the image after racemization has been balanced in the horizontal direction, and the image after racemization has no sawtooth effect, the accuracy of the racemization angle reaches 0.001 °, and the processing time of the frame video image is less than 12ms, which has high real-time performance.
Details not described in the present specification are prior art known to those skilled in the art.
The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.