CN111932436B

CN111932436B - Deep learning processor architecture for intelligent parking

Info

Publication number: CN111932436B
Application number: CN202010862272.8A
Authority: CN
Inventors: 王铭宇; 王堃; 吴晨
Original assignee: Chengdu Star Innovation Technology Co ltd
Current assignee: Chengdu Star Innovation Technology Co ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2024-04-19
Anticipated expiration: 2040-08-25
Also published as: CN111932436A

Abstract

The invention discloses a deep learning processor architecture for intelligent parking, which comprises a high-speed data interface module, a DMA module, a synchronous control module, a deep learning network acceleration module and a memory controller, wherein the deep learning network acceleration module is used for data processing and realizing each deep learning network used by a parking system; the deep learning network acceleration module comprises an external memory reading module, an input feature map memory, a kernel memory, an instruction controller, a data reading module, an inner product accelerator, an output controller, an output feature map memory and an external memory writing module; and the input characteristic diagram memory and the kernel memory both adopt an A/B dual memory mode. The invention improves the calculation efficiency, is used as a hardware accelerator of a high-order video system for a plurality of deep learning networks, and realizes high calculation power and low power consumption of the system.

Description

Deep learning processor architecture for intelligent parking

Technical Field

The invention relates to the field of hardware architecture, in particular to a deep learning processor architecture for intelligent parking.

Background

The high-level video intelligent roadside parking system (hereinafter referred to as a high-level video system) is a system based on computer vision and depth artificial intelligence technology, and has the obvious advantages of high accuracy and capability of identifying license plates, vehicle characteristics and the like compared with the traditional intelligent parking system such as geomagnetic induction. The existing high-level video system generally adopts a high-definition camera and an edge processing unit to monitor and identify a roadside parking space in real time, and if vehicles enter and exit, information such as license plates, vehicle parking time and the like is collected, analyzed and stored. The recognition of roadside parking generally requires a plurality of deep learning networks to recognize the parking space occupation condition, vehicle characteristics, license plates and the like, and the size and calculation time consumption of each network are different.

The existing high-order video system generally adopts a GPU and a CPU as processing units, and the GPU is a general processor of artificial intelligence, so that the general processor has strong universality and simple use, but has low calculation power and large power consumption, and cannot meet the increasing functional and performance requirements of the high-order video system. Therefore, there is a need for a deep learning processor architecture for smart parking that replaces GPUs, using hardware accelerators to multiple deep learning networks as high-level video systems.

Disclosure of Invention

The invention aims at: the deep learning processor architecture for intelligent parking replaces a GPU, interacts with a CPU, and is used as a high-level video system to use hardware accelerators of various deep learning networks, so that high calculation power, low power consumption, high real-time performance and high accuracy of the system are realized.

The technical scheme adopted by the invention is as follows:

the invention relates to a deep learning processor architecture for intelligent parking, which comprises a high-speed data interface module, a DMA module, a synchronous control module, a deep learning network acceleration module and a memory controller,

The high-speed data interface module is used for connecting external equipment and carrying out data interaction;

The synchronous control module comprises synchronous control, a sending data buffer, a receiving data buffer and a receiving address buffer;

The deep learning network acceleration module is used for data processing and realizing each deep learning network used by the parking system; the deep learning network acceleration module comprises an external memory reading module, an input feature map memory, a kernel memory, an instruction controller, a data reading module, an inner product accelerator, an output controller, an output feature map memory and an external memory writing module; the input feature map memory, the kernel memory and the instruction controller are connected with an external memory through an external memory reading module to read data, the input feature map memory and the kernel memory are both in an A/B double-memory mode, and are connected with the data reading module to convey data to the inner product accelerator through the data reading module, and the instruction controller is connected with the inner product accelerator; the output controller is connected with the inner product accelerator for data interaction, the output characteristic diagram memory is connected with the inner product accelerator, and the data processing result of the inner product accelerator is sent to the external storage writing module through the output characteristic diagram memory and finally stored in the external storage;

and the memory controller is in data interaction with the deep learning network acceleration module, intermediate data is stored into the external memory through the memory controller, and the intermediate data can be read in from the external memory.

Furthermore, the parking system uses three deep learning networks, namely a deep learning vehicle recognition network, a deep learning license plate number and a character recognition network.

Furthermore, the deep learning vehicle recognition network, the deep learning license plate number and the character recognition network are all decomposed into substructures which can be reused, and the hardware acceleration of the three deep learning networks is completed through the combination of the calling of a special instruction set.

Furthermore, the kernel memory can also store weight data and bias data.

Further, the high-speed data interface module comprises a PCIe or USB3 high-speed data communication interface.

The invention comprises a high-speed data interface module, a DMA module, a synchronous control module, a deep learning network acceleration module and a memory controller, wherein 5 functional modules are used as auxiliary modules of the deep learning network acceleration module for realizing data interaction. According to the invention, the high-speed data interface module is connected with external equipment, external data is transmitted to the deep learning network acceleration module through the DMA module and then through the synchronous control module, after the deep learning network acceleration module processes the data, some intermediate data is stored into the external storage through the memory controller, and the intermediate data is read in from the external storage when required; and after the deep learning network acceleration module finishes data processing, the data is output to external equipment through the synchronous control module, the DMA module and the high-speed data communication interface.

The deep learning network acceleration module is used as a core module of the invention, and comprises an external memory reading module, an input feature map memory, a kernel memory, an instruction controller, a data reading module, an inner product accelerator, an output controller, an output feature map memory and an external memory writing module, when the deep learning network acceleration module needs to process feature maps and kernel data, the input feature maps and the kernel data are read from an external memory through the external memory reading module and stored in the corresponding input feature map memory and the kernel memory; the inner product accelerator is used as a core computing unit of the deep learning network acceleration module, and the parallel processing units are more, and the processing speed exceeds the data reading speed of the input feature map memory and the kernel memory from the external memory, so that the input feature map memory and the kernel memory both adopt an A/B double-memory mode, namely a ping-pong double-memory mode, when the inner product accelerator reads the kernel memory A through the data reading module, the kernel memory B reads the data of the external memory through the external memory reading module, and when the inner product accelerator finishes the data interaction with the kernel memory A, the inner product accelerator immediately starts the data interaction with the kernel memory B, and the kernel memory A immediately starts the data reading of the next batch. The data in the invention is transmitted to the inner product accelerator through the data reading module, meanwhile, according to the input instruction of the instruction controller, the inner product accelerator can complete various calculations, some intermediate data of the inner product accelerator can be transmitted to the output controller for secondary processing and returned to the inner product accelerator, and after the inner product accelerator completes the calculations, the result can be transmitted to the external storage writing module through the output characteristic diagram memory and finally stored in the external storage.

In summary, by adopting the technical scheme, the invention has the beneficial effects that:

1. The invention relates to a deep learning processor architecture for intelligent parking, which replaces a GPU (graphics processing unit), interacts with a CPU (central processing unit), uses an inner product accelerator as a core computing unit of a deep learning network acceleration module, reads data and stores the data in an input feature map memory and a core memory of an A/B (analog/digital) double memory, improves the computing efficiency, and uses the data as a hardware accelerator of a high-order video system to a plurality of deep learning networks, thereby realizing high computing power, low power consumption, high instantaneity and high accuracy of the system.

2. The invention relates to a deep learning processor architecture for intelligent parking, which successfully balances and optimizes power consumption, calculation speed, calculation delay and circuit size when a deep learning network is processed through the processor architecture, so that a high-level video intelligent roadside parking system is in an optimal state.

Drawings

For a clearer description of the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and should not be considered as limiting the scope, for those skilled in the art, without performing creative efforts, other related drawings may be obtained according to the drawings, where the proportional relationships of the components in the drawings in the present specification do not represent the proportional relationships in actual material selection design, and are merely schematic diagrams of structures or positions, where:

FIG. 1 is a block diagram of the present invention;

Fig. 2 is a block diagram of a deep learning network acceleration module.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the invention, i.e., the embodiments described are merely some, but not all, of the embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

All of the features disclosed in this specification, or all of the steps in a method or process disclosed, may be combined in any combination, except for mutually exclusive features and/or steps.

The present invention will be described in detail with reference to the accompanying drawings.

Example 1

As shown in fig. 1-2, the present invention is a deep learning processor architecture for smart parking, comprising a high-speed data interface module, a DMA module, a synchronization control module, a deep learning network acceleration module and a memory controller,

In the invention, the processor architecture is realized on an FPGA chip, and Kintex-7 series FPGA chips are selected in the embodiment. The invention comprises a high-speed data interface module, a DMA module, a synchronous control module, a deep learning network acceleration module and a memory controller, wherein 5 functional modules are used as auxiliary modules of the deep learning network acceleration module for realizing data interaction. The invention connects the external device through the high-speed data interface module, the external data passes the DMA module, the DMA module is the direct memory access module, and then sends to the deep learning network acceleration module through the synchronous control module, after the deep learning network acceleration module processes the data, some intermediate data is stored into the external memory through the memory controller, and when the data is needed, the intermediate data is read in from the external memory; and after the deep learning network acceleration module finishes data processing, the data is output to external equipment through the synchronous control module, the DMA module and the high-speed data communication interface.

The deep learning network acceleration module is used as a core module of the invention, and comprises an external memory reading module, an input feature map memory, a kernel memory, an instruction controller, a data reading module, an inner product accelerator, an output controller, an output feature map memory and an external memory writing module, when the deep learning network acceleration module needs to process feature maps and kernel data, the input feature maps and the kernel data are read from an external memory through the external memory reading module and stored in the corresponding input feature map memory and the kernel memory; the inner product accelerator is used as a core computing unit of the deep learning network acceleration module, and the parallel processing units are more, and the processing speed exceeds the data reading speed of the input feature map memory and the inner core memory from the external memory, so that the input feature map memory and the inner core memory both adopt an A/B double-memory mode, namely a ping-pong double-memory mode, when the inner product accelerator reads the inner core memory A through the data reading module, the inner core memory B reads the data of the external memory through the external memory reading module, and when the inner product accelerator finishes the data interaction with the inner core memory A, the inner core memory A immediately starts to read the next batch of data after the data interaction with the inner core memory B, so that the data processing efficiency is improved. The data in the invention is transmitted to the inner product accelerator through the data reading module, meanwhile, according to the input instruction of the instruction controller, the inner product accelerator can complete various calculations, some intermediate data of the inner product accelerator can be transmitted to the output controller for secondary processing and returned to the inner product accelerator, and after the inner product accelerator completes the calculations, the result can be transmitted to the external storage writing module through the output characteristic diagram memory and finally stored in the external storage.

The inner product accelerator is used as a core computing unit of the deep learning network acceleration module and can be represented by IPA, and is mainly used for completing parallel computation of multiplication and addition. The inner product accelerator receives a feature map vector with a length of 512 x 8 bits and a kernel vector with a length of 1024 x 8 bits and outputs inner products thereof, which can be 64, 32, 16 and 2 bit point numbers according to the instruction configuration of the instruction controller.

The inner product accelerator contains 32 processing units, each processing unit is composed of 16 multiplication units, each multiplication unit is a DSP unit of an FPGA, and can receive two groups of data to perform multiplication twice respectively, so we call IPA as IPA1 and IPA2, and inject different data into the IPA1 and IPA2 respectively. The method comprises the following steps:

A DSP48E in the FPGA can perform a calculation mode of p= (d±a) ×b±c, and the DSP48E is configured as p= (d+a) ×b+c, so that two 16-bit multiplications of p1=a×k1 and p2=a×k2 are required to be calculated, and the two multiplications are replaced by a 32-bit multiplication, where the substitution formula is p=a× (K1 < < 16+k2); a 32-bit output P may be obtained, where the upper 16 bits of P are P1 and the lower 16 bits are P2, C is configured based on the output result, c=0 if P2 is greater than or equal to 0, otherwise c=1 < <16.

The external memory read module reads data from the external memory, and the data read is controlled by a state machine: the state machine is divided into: idle, read feature map, read kernel, read bias, read instruction four states.

An external memory write module that performs each convolutional layer preload and post processing steps such as activation and Pooling operations, and then writes the data to external memory.

An output controller that calculates a final output of each convolution layer for an output of an Inner Product Accelerator (IPA), and completes data truncation.

Example two

This example is a further illustration of the present invention.

Based on the above embodiments, the parking system uses three deep learning networks, which are a deep learning vehicle recognition network, a deep learning license plate number and a word recognition network. The deep learning vehicle recognition network is used for an object recognition network for vehicle recognition, the deep learning license plate recognition network is used for a miniature object recognition network for license plate recognition, and the deep learning license plate number and character recognition network is used for a small object recognition network for license plate number and character recognition.

Furthermore, the deep learning vehicle recognition network, the deep learning license plate number and the character recognition network are all decomposed into substructures which can be reused, and the hardware acceleration of the three deep learning networks is completed through the combination of the calling of a special instruction set. The chip circuit area is saved, and the forward compatibility of the chip design is improved, so that the chip can meet the requirement of a new artificial intelligent network possibly occurring in the future as far as possible.

Example III

This example is a further illustration of the present invention.

The embodiment is based on the above embodiment, where the kernel memory is further capable of storing weight data and bias data. The external memory reading module reads the feature map, the kernel, the weight data and the bias data from the external memory, wherein the feature map is stored in the input feature map memory, and the kernel, the weight data and the bias data are stored in the kernel memory.

Example IV

This example is a further illustration of the present invention.

In this embodiment, based on the foregoing embodiment, the high-speed data interface module includes a PCIe or USB3 high-speed data communication interface. The corresponding external device should have a PCIe or USB3 data interface, such as a computer, corresponding thereto.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that are not creatively contemplated by those skilled in the art within the technical scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope defined by the claims.

Claims

1. A deep learning processor architecture for smart parking, characterized by: comprises a high-speed data interface module, a DMA module, a synchronous control module, a deep learning network acceleration module and a memory controller,

The memory controller is in data interaction with the deep learning network acceleration module, intermediate data is stored into the external memory through the memory controller, and the intermediate data can be read in from the external memory;

the inner product accelerator is used as a core computing unit of a deep learning network acceleration module and is represented by IPA;

the inner product accelerator comprises 32 processing units in total, each processing unit consists of 16 multiplication units, each multiplication unit is a DSP unit of an FPGA, two groups of data are received to respectively execute two multiplications, IPA is called IPA1 and IPA2, and different data are respectively injected into the two groups of data, and the inner product accelerator specifically comprises the following steps:

a DSP48E in the FPGA performs p= (d±a) ×b±c, configures the DSP48E as p= (d+a) ×b+c, calculates two 16-bit multiplications of p1=ak1, p2=ak2, replaces the two multiplications with a 32-bit multiplication, and replaces the two multiplications with the formula p=a (k1 < < 16+k2); and obtaining 32-bit output P, wherein the upper 16 bits of P are P1, the lower 16 bits are P2, C is configured based on the output result, C=0 is caused if P2 is more than or equal to 0, and otherwise, C=1 < < 16.

2. The deep learning processor architecture for smart parking of claim 1, wherein: the parking system uses three deep learning networks, namely a deep learning vehicle recognition network, a deep learning license plate number and a character recognition network.

3. The deep learning processor architecture for smart parking of claim 2, wherein: the deep learning vehicle recognition network, the deep learning license plate number and the character recognition network are all decomposed into substructures which can be reused, and the hardware acceleration of the three deep learning networks is completed through the combination of the calling of a special instruction set.

4. The deep learning processor architecture for smart parking of claim 1, wherein: the kernel memory is also capable of storing weight data and bias data.

5. The deep learning processor architecture for smart parking of claim 1, wherein: the high speed data interface module comprises a PCIe or USB3 high speed data communication interface.