CN101739235A

CN101739235A - Processor device that seamlessly mixes 32-bit DSP and general-purpose RISC CPU

Info

Publication number: CN101739235A
Application number: CN200810227482A
Authority: CN
Inventors: 梁利平; 王志君
Original assignee: Institute of Microelectronics of CAS
Current assignee: Institute of Microelectronics of CAS
Priority date: 2008-11-26
Filing date: 2008-11-26
Publication date: 2010-06-16

Abstract

The invention discloses a processor device that seamlessly mixes 32-bit DSP and general-purpose RISC CPU. Based on the same hardware platform, the processor device can not only run special 32-bit DSP instructions and SIMD instructions based on DSP processing, but also can Compatibly run general-purpose RISC CPU instructions through instruction reconstruction. In addition, a single DSP instruction supports 32-bit data operations, or two parallel 16-bit data operations. The processor adopts a 4-channel launch mechanism to increase the parallelism of the processor, thereby improving the DSP calculation efficiency of the processor and the execution control efficiency of the CPU. The processor realizes the unification and transparency of DSP processing and CPU processing, inherits the software resources and development tools carried by general-purpose RISCCPU, and the calculation function of DSP is provided through the function library, which saves the development resources of DSP processing system and is convenient for users to develop the system. develop.

Description

Processor device that seamlessly mixes 32-bit DSP and general-purpose RISC CPU

技术领域technical field

本发明涉及处理器体系结构技术领域，更具体的说，本发明是一种将32位DSP与通用RISC CPU无缝混链的处理器装置，该处理器装置可以运行专用32位DSP指令以及基于DSP处理的SIMD指令，而且可以通过指令重构方式兼容运行通用RISC CPU指令。The present invention relates to the technical field of processor architecture. More specifically, the present invention is a processor device that seamlessly mixes 32-bit DSP and general-purpose RISC CPU. The processor device can run special 32-bit DSP instructions and based on SIMD instructions processed by DSP, and can be compatible with general RISC CPU instructions through instruction reconstruction.

背景技术Background technique

传统的通用精简指令集处理器(RISC CPU)是面向寄存器进行操作，带来的好处是通常RISC CPU的指令格式简单，结构规整，软、硬件设计都比较简单。但是另一方面，RISC CPU只能依靠load/store指令来访问存储单元，寻址模式单一，且所能支持的运算指令类型相对DSP处理器而言也过于简单。因此对于普通的一些数据量比较小的控制类应用，RISC CPU显得非常合适，然而随着技术的不断进步，人们对处理器应用的要求越来越高，诸如多媒体、通信等应用需要快速、大量的数字信号处理，单依靠RISC CPU的简单结构似乎已经无法胜任，RISC CPU单一的寻址模式和不够强大的运算能力成为数据处理性能上的瓶颈。The traditional general-purpose reduced instruction set processor (RISC CPU) operates on registers. The advantage is that the instruction format of the RISC CPU is usually simple, the structure is regular, and the software and hardware design are relatively simple. But on the other hand, RISC CPU can only rely on load/store instructions to access storage units, the addressing mode is single, and the types of operation instructions it can support are too simple compared to DSP processors. Therefore, RISC CPU is very suitable for ordinary control applications with relatively small data volume. However, with the continuous advancement of technology, people have higher and higher requirements for processor applications. It seems that the simple structure of RISC CPU alone is not competent for digital signal processing. The single addressing mode and insufficient computing power of RISC CPU have become the bottleneck of data processing performance.

数字信号处理器(DSP)是针对RISC CPU在针对大量数字信号处理应用上的不足而设计，它将重点放在了提高数据的处理能力上，尤其是针对诸如FFT、FIR滤波器等数字信号处理的能力上。一般DSP处理器都包含有一个独立的地址生成模块(DAG)，从而可以支持多种寻址模式。另外一个DSP处理器不同于RISC CPU的地方是DSP处理器内部包含有快速的乘累加模块可以支持多样化的运算类指令。这两点使得DSP处理器对数据处理具有灵活的模式和很快的速度，但是由于DSP是专注于做数字信号的处理，因此其对于系统的控制能力往往不如RISC CPU。Digital Signal Processor (DSP) is designed to address the shortcomings of RISC CPUs in a large number of digital signal processing applications. It focuses on improving data processing capabilities, especially for digital signal processing such as FFT and FIR filters. ability. A general DSP processor includes an independent address generation module (DAG), which can support multiple addressing modes. Another difference between a DSP processor and a RISC CPU is that the DSP processor contains a fast multiply-accumulate module that can support a variety of arithmetic instructions. These two points make the DSP processor have a flexible mode and fast speed for data processing, but because DSP is focused on digital signal processing, its ability to control the system is often not as good as RISC CPU.

传统的处理器芯片设计中，一般都是选择单纯的RISC CPU处理器或者是DSP处理器。为了能比较好的利用二者的优势，有的解决方案尝试将两者结合起来。一种常见的融合方案中，RISC CPU和DSP处理器仍旧是两个独立的单元，DSP作为通用处理器的一个加速装置。但是这种结构只是简单的对两种处理器做拼凑，硬件开销是两个处理器之和，且二者之间的数据和控制信号的传递会很复杂，需要一套外围电路来做处理，使得系统并不易于使用。也有一些方案通过完全自定义的一套带有RISC CPU指令和DSP指令的全新指令集来实现结构的融合，这样的方法虽然硬件各方面都可以做到优化，但是相应的对于一套新的指令集和硬件结构，配套的软件系统也需要重新开发，增加了开发的成本。In traditional processor chip design, a simple RISC CPU processor or a DSP processor is generally selected. In order to make better use of the advantages of the two, some solutions try to combine the two. In a common fusion scheme, the RISC CPU and DSP processor are still two independent units, and the DSP is used as an acceleration device for the general-purpose processor. But this structure is simply a patchwork of two processors, the hardware overhead is the sum of the two processors, and the transmission of data and control signals between the two will be very complicated, requiring a set of peripheral circuits for processing. Makes the system not easy to use. There are also some solutions to achieve structural integration through a completely customized set of brand-new instruction sets with RISC CPU instructions and DSP instructions. Although this method can optimize all aspects of hardware, the corresponding set of new instructions The set and hardware structure, and the supporting software system also need to be redeveloped, which increases the cost of development.

发明内容Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

有鉴于此，如果能基于已有的一些RISC CPU指令集，在不改变其指令编码和执行顺序的前提下，提出一种芯片架构上的RISC CPU和DSP处理器的纯融合结构，以RISC通用处理器作为主控制，用户在使用时可以继续通过RISC CPU承载的软件资源和开发工具做开发，DSP的开发则以函数库的方式完成，使用DSP功能时类似一种子程序调用的形式，诸如FFT、FIR、一些常用矩阵处理等都可以预先通过DSP指令的开发做成函数库，当软件编译后发现用到这类函数时可以直接调用。用户在使用处理器时就好像还是在用一个普通RISC CPU，这样能更好的利用已有的软件资源和硬件资源来完成各种处理的需求，无疑将会更好的满足人们的应用。In view of this, if based on some existing RISC CPU instruction sets, without changing the instruction encoding and execution order, a pure fusion structure of RISC CPU and DSP processor on the chip architecture can be proposed, which can be generalized with RISC. The processor is used as the main control, and the user can continue to develop through the software resources and development tools carried by the RISC CPU when using it. The development of the DSP is completed in the form of a function library. When using the DSP function, it is similar to a form of subroutine call, such as FFT , FIR, some commonly used matrix processing, etc. can be made into a function library through the development of DSP instructions in advance. When the software is compiled and found to use such functions, it can be called directly. When users use the processor, it is as if they are still using an ordinary RISC CPU. In this way, the existing software resources and hardware resources can be better used to complete various processing needs, which will undoubtedly better meet people's applications.

因此，本发明主要解决目前已有的嵌入式开发系统中，RISC CPU和DSP处理器结构无法有效融合的问题，提出一种将32位DSP与通用RISCCPU无缝混链的处理器装置。Therefore, the present invention mainly solves the problem that RISC CPU and DSP processor structures cannot be effectively integrated in existing embedded development systems, and proposes a processor device that seamlessly mixes 32-bit DSP and general-purpose RISCCPU.

(二)技术方案(2) Technical solution

为达到上述目的，本发明提供了一种将32位DSP与通用RISC CPU无缝混链的处理器装置，该处理器装置由指令存储器模块、数据存储器模块、装载存储单元、寄存器堆模块、取指模块、指令预解码和分发模块、并行度控制模块、协处理器模块，以及4个指令执行通道构成，用于运行专用32位DSP指令和基于DSP处理的SIMD指令，且能够通过指令重构方式兼容运行通用RISC CPU指令。In order to achieve the above object, the present invention provides a processor device that seamlessly mixes 32-bit DSP and general-purpose RISC CPU. The processor device consists of an instruction memory module, a data memory module, a load storage unit, a register file module, Refers to the module, instruction pre-decoding and distribution module, parallelism control module, coprocessor module, and four instruction execution channels, used to run dedicated 32-bit DSP instructions and SIMD instructions based on DSP processing, and can be reconfigured through instructions Compatible with running general-purpose RISC CPU instructions.

上述方案中，所述指令存储器模块用于存储指令程序，采用程序ROM/RAM或cache结构，且在采用cache结构时支持多路组相连结构。In the above solution, the instruction memory module is used to store instruction programs, adopts a program ROM/RAM or cache structure, and supports a multi-way group connection structure when the cache structure is adopted.

上述方案中，所述数据存储器模块用于存储数据和中间计算结果，能够被配置成完全的D-cache(数据高速缓存存储器)结构，或被配置成D-cache与高速暂存存储器(Scratchpad Memory)相结合的结构。所述高速暂存存储器用于提供高速的数据存储，供处理器在执行DSP指令时使用；该高速暂存存储器中的有一部分空间专门作为栈空间，用于做DSP堆栈以支持栈指令的处理，栈的存储位宽为256bit，每行的256bit用于存放8个32位通用寄存器值，能够在一个时钟周期内实现8个寄存器数据同时的入栈或出栈操作，将8个寄存器数据同时堆入栈中或者将栈中数据取出根据出栈指令的要求存入到8个寄存器中相应的寄存器。In the above scheme, the data memory module is used to store data and intermediate calculation results, and can be configured as a complete D-cache (data cache memory) structure, or as a D-cache and high-speed scratchpad memory (Scratchpad Memory) structure. ) combined structure. The high-speed temporary storage memory is used to provide high-speed data storage for the processor to use when executing DSP instructions; a part of the space in the high-speed temporary storage memory is specially used as a stack space for DSP stack to support the processing of stack instructions , the storage bit width of the stack is 256bit, and the 256bit of each row is used to store 8 32-bit general-purpose register values, which can realize the simultaneous push or pop operation of 8 register data in one clock cycle, and the 8 register data at the same time Stack the data into the stack or take out the data from the stack and store them in the corresponding registers among the 8 registers according to the requirements of the pop-out instruction.

上述方案中，所述的装载存储单元主要用于处理器在RISC CPU模式下支持Load/Store指令的操作，它连接系统总线和数据高速缓存存储器。In the above scheme, the load storage unit is mainly used for the processor to support the operation of the Load/Store instruction in the RISC CPU mode, and it is connected to the system bus and the data cache memory.

上述方案中，所述寄存器堆模块是在现有RISC通用处理器寄存器堆的基础上增加32个DSP专用的32bit寄存器而形成的，该32个DSP专用的32bit寄存器分成4组数据寄存器和8组地址寄存器，数据寄存器和地址寄存器分别都由16个32bit的寄存器构成。In the above scheme, the register file module is formed by adding 32 DSP-specific 32-bit registers on the basis of the existing RISC general-purpose processor register file. The 32 DSP-specific 32-bit registers are divided into 4 groups of data registers and 8 groups of Address register, data register and address register are composed of 16 32bit registers respectively.

上述方案中，所述每组数据寄存器的构成类型相同，每组数据寄存器包含有4个32bit寄存器，其中2个32bit的寄存器合并组成的一个64bit的乘累加寄存器、1个32bit乘数寄存器M和1个32bit暂存寄存器T，乘数寄存器用作乘法指令中的一个操作数，暂存寄存器用来保存中间结果数据。In the above scheme, each group of data registers has the same composition type, and each group of data registers includes four 32bit registers, wherein two 32bit registers are merged to form a 64bit multiplication and accumulation register, one 32bit multiplier register M and A 32bit temporary register T, the multiplier register is used as an operand in the multiplication instruction, and the temporary register is used to save the intermediate result data.

上述方案中，所述每组地址寄存器的构成类型相同，每组地址寄存器包含4个16bit地址寄存器，其中一个16位地址基址寄存器(ARB)、一个16位地址寄存器(AR)、一个16位地址偏移寄存器(ARO)和一个16位循环/全局寻址地址寄存器(ARC)。In the above scheme, the composition type of each group of address registers is the same, and each group of address registers includes 4 16bit address registers, wherein a 16-bit address base register (ARB), a 16-bit address register (AR), and a 16-bit address register (AR). address offset register (ARO) and a 16-bit circular/global addressing address register (ARC).

上述方案中，所述寄存器堆模块对于已有的RISC CPU指令，其可见的寄存器数目和功能不变，RISC CPU指令对CPU可见的寄存器进行应用和修改；所述寄存器堆模块对于DSP专用指令，能够访问整个寄存器堆，并能够对其值进行修改。In the above scheme, the register file module has the same number of visible registers and functions for the existing RISC CPU instructions, and the RISC CPU instructions apply and modify the registers visible to the CPU; the register file module is for DSP special instructions, Ability to access the entire register file and modify its values.

上述方案中，所述寄存器堆模块还包括一个32位的专用控制寄存器，用于指明DSP程序和数据的页地址；该专用控制寄存器不在通用寄存器堆中，包含14bit的DSP程序页地址、14bit的DSP数据页地址和4bit的DSP指令的状态位；该专用控制寄存器中的页地址用于DSP的跳转指令中，用来与地址寄存器形成全局访问地址，DSP的跳转只能在页范围内进行跳转，该专用控制寄存器DSP本身无法修改，由通用处理器将其作为本身的协处理器寄存器设置和更改。In the above scheme, the register file module also includes a 32-bit special-purpose control register, which is used to indicate the page address of the DSP program and data; DSP data page address and 4-bit status bits of DSP instructions; the page address in the special control register is used in the DSP jump instruction to form a global access address with the address register, and the DSP jump can only be within the page range Jump, the special control register DSP itself cannot be modified, it is set and changed by the general-purpose processor as its own coprocessor register.

上述方案中，所述取指模块用于从程序存储器中取出指令，处理器可支持的最大指令分发并行度为4，为了保证指令的正确分发，指令槽宽度为16条指令，每当8条指令分发完毕以后，取指模块从程序存储器中取出新的8条指令放入指令槽。In the above solution, the instruction fetch module is used to fetch instructions from the program memory, and the maximum instruction distribution parallelism that the processor can support is 4. In order to ensure the correct distribution of instructions, the instruction slot width is 16 instructions. After the instructions are distributed, the instruction fetching module takes out 8 new instructions from the program memory and puts them into the instruction slot.

上述方案中，所述指令的预解码和分发模块用于通过解码指令的前8位来识别指令是属于RISC CPU指令、DSP指令或者SIMD指令，并根据指令中所体现的并行度情况进行指令分发。In the above scheme, the pre-decoding and distribution module of the instruction is used to identify whether the instruction belongs to a RISC CPU instruction, a DSP instruction or a SIMD instruction by decoding the first 8 bits of the instruction, and distribute the instruction according to the degree of parallelism embodied in the instruction .

上述方案中，所述并行度控制模块用于根据预解码模块解码出的指令类和相应比特位信息生成并行度控制信号反馈给分发模块。In the above solution, the parallelism control module is used to generate a parallelism control signal and feed it back to the distribution module according to the instruction class and corresponding bit information decoded by the pre-decoding module.

上述方案中，所述协处理器用于对整体处理器进行控制，并与已有RISC通用处理器的控制保持一致，包括各种状态寄存器控制、运行模式和中断控制；同时，协处理器中还提供浮点运算单元，进一步增强处理器的数据处理能力。In the above scheme, the coprocessor is used to control the overall processor, and is consistent with the control of the existing RISC general-purpose processor, including various status register controls, operating modes and interrupt control; meanwhile, the coprocessor also Provide a floating point operation unit to further enhance the data processing capability of the processor.

上述方案中，所述处理器内部的4个指令执行通道结构相同，每一个通道能够独立完成所有的指令；每个通道中包括指令解码单元、程序流控制单元、状态控制单元、地址生成单元、状态寄存器和多个执行部件；执行部件有：数据存储控制单元、运算逻辑单元、乘累加单元和移位单元。In the above scheme, the four instruction execution channels inside the processor have the same structure, and each channel can independently complete all instructions; each channel includes an instruction decoding unit, a program flow control unit, a state control unit, an address generation unit, A state register and multiple execution units; the execution units include: a data storage control unit, an operation logic unit, a multiply-accumulate unit and a shift unit.

上述方案中，该处理器装置采用多级流水结构，流水线主要包括取指IA、分发DP、解码ID、访存MEM、传输TR、执行EX和回写WB。In the above solution, the processor device adopts a multi-stage pipeline structure, and the pipeline mainly includes fetching IA, distributing DP, decoding ID, accessing MEM, transmitting TR, executing EX and writing back WB.

上述方案中，该处理器装置内部提供3个旁路逻辑，用于减少数据冲突，分别是：从EX级到ID级的旁路逻辑通路、从TR级到ID级的旁路逻辑通路和从EX级到TR级的旁路逻辑通路。In the above solution, the processor device internally provides three bypass logics for reducing data conflicts, namely: a bypass logic path from the EX level to the ID level, a bypass logic path from the TR level to the ID level, and a bypass logic path from the Bypass logic path from EX stage to TR stage.

上述方案中，该处理器装置进一步采用内锁机制解决流水线或数据冲突，一旦有冲突发生，内锁机制会自动停止EX级以前的流水线动作。In the above solution, the processor device further uses an internal lock mechanism to resolve pipeline or data conflicts. Once a conflict occurs, the internal lock mechanism will automatically stop the pipeline operations before the EX level.

上述方案中，所述专用DSP指令用于面向寄存器数据进行操作，或者直接面向存储器数据进行操作；DSP指令能够支持32位的数据操作，或支持2个16位数据的并行操作。In the above solution, the dedicated DSP instruction is used to operate on register data, or directly operate on memory data; the DSP instruction can support 32-bit data operations, or support parallel operations of two 16-bit data.

上述方案中，所述的专用DSP指令编码中，前6bit为指令标识，另外有2bit位来标识其与前后指令的并行度，后24bit为具体的DSP指令；专用DSP指令主要包括算术、乘累加、逻辑、移位、条件跳转和数据存储类指令，并提供了多种寻址模式和快速的数据堆栈处理。In the above scheme, in the special-purpose DSP instruction coding, the first 6 bits are instruction identifiers, and 2 bits are used to identify the degree of parallelism between it and the front and rear instructions, and the latter 24 bits are specific DSP instructions; the special-purpose DSP instructions mainly include arithmetic, multiply and accumulate , logic, shift, conditional jump and data storage instructions, and provides a variety of addressing modes and fast data stack processing.

上述方案中，所述SIMD指令编码的前8bit为指令标识，其后的24bit编码规则和DSP指令后24bit编码规则相同，进行的操作也相同；使用SIMD指令时，4个通道都开始工作，进行相同的指令操作，只是对应的寄存器有所不同；SIMD指令本身没有并行度，每个时钟只能处理1条SIMD指令。In the above scheme, the first 8 bits of the SIMD instruction encoding are instruction identifiers, and the following 24bit encoding rules are the same as the DSP instruction rear 24bit encoding rules, and the operations are also the same; The same instruction operates, but the corresponding registers are different; SIMD instructions themselves have no parallelism, and each clock can only process one SIMD instruction.

上述方案中，该处理器装置采用4通道指令并行发射机制，最大支持一个时钟周期发射4条指令。In the above solution, the processor device adopts a 4-channel instruction parallel issuing mechanism, and supports a maximum of 4 instructions in one clock cycle.

(三)有益效果(3) Beneficial effects

从上述的技术方案看，本发明具有的有益效果是：From above-mentioned technical scheme, the beneficial effect that the present invention has is:

首先，本发明提供的处理器装置实现了DSP和通用RISC CPU无缝混链，继承了32位RISC CPU指令集承载的嵌入式操作系统、编译器以及各种控制协议栈；First of all, the processor device provided by the present invention realizes the seamless mixed chain of DSP and general-purpose RISC CPU, and inherits the embedded operating system, compiler and various control protocol stacks carried by the 32-bit RISC CPU instruction set;

第二，DSP指令支持多种寻址模式和数据运算模式，并且基于DSP指令的各种算法通过扩充编译器的函数库和链接库就实现，这些函数库能够使本发明处理器在DSP状态下有效地完成信号处理中密集数据计算的功能，从而实现用户DSP计算的“透明化”；The second, the DSP instruction supports multiple addressing modes and data operation modes, and various algorithms based on the DSP instruction are just realized by expanding the function library and the link library of the compiler, and these function libraries can make the processor of the present invention operate in the DSP state Effectively complete the function of intensive data calculation in signal processing, so as to realize the "transparency" of user DSP calculation;

第三，本发明处理器的4通道发射机制增加了程序处理的并行度，大大提高了处理器对密集数据进行处理计算的能力。Thirdly, the 4-channel launch mechanism of the processor of the present invention increases the parallelism of program processing and greatly improves the processor's ability to process and calculate intensive data.

附图说明Description of drawings

图1是本发明提供的处理器装置的整体结构示意图；FIG. 1 is a schematic diagram of the overall structure of a processor device provided by the present invention;

图2是本发明提供的处理器装置的流水线的示意图；Fig. 2 is a schematic diagram of the pipeline of the processor device provided by the present invention;

图3是本发明提供的处理器装置中DSP指令和SIMD指令的编码结构示意图；Fig. 3 is a schematic diagram of the encoding structure of DSP instructions and SIMD instructions in the processor device provided by the present invention;

图4是本发明中扩展的32个32bit位宽DSP专用寄存器的的分配情况示意图。Fig. 4 is a schematic diagram of the allocation of 32 extended 32-bit DSP special registers in the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明以DSP和RISC CPU合二为一为出发点，将面向嵌入式处理的DSP和CPU的架构统一，提出了一种32位DSP和通用RISC CPU无缝混链的处理器装置。该处理器装置基于同一硬件平台，不但可以运行专用32位DSP指令以及基于DSP处理的SIMD(Single Instruction Multiple Data：单指令多数据)指令，而且可以通过指令重构方式兼容运行通用RISC CPU指令。该处理器的架构原型能通过指令重构方式支持成熟的、IP可商用授权开放的嵌入式CPU指令集，同时该架构原型也能够通过指令重构的方式切换DSP处理状态。另一方面，单条DSP指令可以支持32位的数据操作，或2个并行16位的数据操作。处理器采用了4通道发射机制来增加处理器的并行度，从而提高了该处理器的DSP计算效率和CPU的执行控制效率。该处理器实现了DSP处理和CPU处理的统一和透明化，可以直接兼容成熟通用RISC CPU承载的软件资源和开发工具。DSP的计算功能通过函数库方式提供。该处理器装置节省了DSP处理系统开发资源，方便用户进行系统开发。The invention starts from the integration of DSP and RISC CPU into one, unifies the architecture of DSP and CPU for embedded processing, and proposes a processor device with seamless mixed chain of 32-bit DSP and general RISC CPU. Based on the same hardware platform, the processor device can not only run dedicated 32-bit DSP instructions and SIMD (Single Instruction Multiple Data) instructions based on DSP processing, but also can run general-purpose RISC CPU instructions through instruction reconstruction. The architecture prototype of the processor can support the mature embedded CPU instruction set with open IP commercial authorization through instruction reconstruction, and the architecture prototype can also switch the DSP processing state through instruction reconstruction. On the other hand, a single DSP instruction can support 32-bit data operations, or two parallel 16-bit data operations. The processor adopts a 4-channel launch mechanism to increase the parallelism of the processor, thereby improving the DSP calculation efficiency of the processor and the execution control efficiency of the CPU. The processor realizes the unification and transparency of DSP processing and CPU processing, and is directly compatible with software resources and development tools carried by mature general-purpose RISC CPUs. The calculation function of DSP is provided through the function storehouse way. The processor device saves development resources of a DSP processing system and facilitates system development for users.

处理器的硬件结构是完全融合的，可以自由的执行各种指令格式的指令，不存在RISC CPU数据通道和DSP数据通道要独立处理不同指令的情况。因此对于一些数据量大的数字信号处理或者矩阵处理等应用，可以利用扩展的DSP指令和提供的SIMD指令来完成，从而把RISC通用处理器功能从繁重的数据处理任务中解放出来，更多的专注于控制方面的应用。The hardware structure of the processor is fully integrated and can freely execute instructions in various instruction formats. There is no situation where the RISC CPU data channel and the DSP data channel have to process different instructions independently. Therefore, for some applications such as digital signal processing or matrix processing with a large amount of data, the extended DSP instructions and the provided SIMD instructions can be used to complete, thereby liberating the RISC general-purpose processor functions from heavy data processing tasks, and more Focus on control applications.

该处理器的DSP指令集并不仅仅是面向寄存器进行操作的，其中还提供了直接面向存储器进行操作的指令，这不但节省了指令所占用的程序空间，同时也使得处理器对数据的处理速度大大提高。另一方面，扩展的DSP指令不仅支持32位的数据操作，同时也支持或2个并行16位的数据操作，使得对数据的处理形式更加多样化，适用范围也更加广泛。所述的DSP指令主要包括算术、乘累加、逻辑、移位、条件跳转和数据存储类指令，并提供了多种寻址模式和快速的数据堆栈处理方式。为了更大程度的发挥DSP处理器和RISC CPU各自的优势，DSP指令的设计主要用来满足高速的计算需求，而是把对系统的控制留给RISC通用处理器指令来完成，因此DSP指令中没有设计过多对于中断控制等方面的指令。这同时也节省了整体指令集的指令条数，减小了解码模块的负担。The processor's DSP instruction set not only operates on registers, but also provides instructions that directly operate on memory, which not only saves the program space occupied by instructions, but also enables the processor to process data faster. Greatly improve. On the other hand, the extended DSP instructions not only support 32-bit data operations, but also support two parallel 16-bit data operations, making the data processing forms more diverse and the scope of application wider. The DSP instructions mainly include arithmetic, multiply-accumulate, logic, shift, conditional jump and data storage instructions, and provide multiple addressing modes and fast data stack processing methods. In order to give full play to the respective advantages of DSP processor and RISC CPU, the design of DSP instructions is mainly used to meet the high-speed calculation requirements, but leave the control of the system to the RISC general processor instructions to complete, so the DSP instructions There are not too many instructions for interrupt control and other aspects designed. This also saves the number of instructions in the overall instruction set and reduces the burden on the decoding module.

处理器采用了4通道发射机制来增加处理器的并行度，在DSP指令编码中，前8bit中有2bit位来标识其与前后指令的并行度，后24bit是具体的DSP操作指令。这使得即使原始的RISC CPU指令中没有标识并行度的信息，也可以通过和其相临的DSP指令中的并行标识比特来加以分辨，从而实现了处理器指令分发尽可能大的并行度。The processor uses a 4-channel transmission mechanism to increase the parallelism of the processor. In the DSP instruction code, there are 2 bits in the first 8 bits to identify its parallelism with the previous and subsequent instructions, and the last 24 bits are specific DSP operation instructions. This makes it possible to identify parallelism through the parallel identification bits in the adjacent DSP instructions even if there is no information identifying the degree of parallelism in the original RISC CPU instruction, thus achieving the largest possible degree of parallelism in processor instruction distribution.

本发明处理器的指令集还支持部分SIMD指令，它可以算作DSP指令的一种特殊情况，主要用来做数据的运算操作，包括ALU、MAC和load/store操作，SIMD指令尤其适合于一些矩阵处理等运算。SIMD指令编码的前8bit作为标识，让处理器知道该条指令是SIMD指令，其后的24bit编码规则和DSP指令后24bit编码规则相同，进行的操作也相同。在使用SIMD指令时，4个通道都开始工作，进行相同的指令操作，只是对应的寄存器有所不同。SIMD指令本身没有并行度，每个时钟只能处理1条SIMD指令。The instruction set of the processor of the present invention also supports part of the SIMD instruction, which can be regarded as a special case of the DSP instruction, and is mainly used for data operations, including ALU, MAC and load/store operations. The SIMD instruction is especially suitable for some Matrix processing and other operations. The first 8 bits of the SIMD instruction code are used as an identification to let the processor know that the instruction is a SIMD instruction. The subsequent 24-bit encoding rules are the same as the 24-bit encoding rules after the DSP instruction, and the operations are also the same. When using SIMD instructions, all four channels start to work and perform the same instruction operation, but the corresponding registers are different. SIMD instructions themselves have no parallelism and can only process 1 SIMD instruction per clock.

如图1所示，图1是本发明提供的处理器装置的整体结构示意图，该处理器装置由指令存储器模块、数据存储器模块、装载存储单元、寄存器堆模块、取指模块、指令预解码和分发模块、并行度控制模块、协处理器模块，以及4个指令执行通道构成，用于运行专用32位DSP指令和基于DSP处理的SIMD指令，且能够通过指令重构方式兼容运行通用RISC CPU指令。As shown in Figure 1, Figure 1 is a schematic diagram of the overall structure of the processor device provided by the present invention, the processor device consists of an instruction memory module, a data memory module, a load storage unit, a register file module, an instruction fetch module, an instruction predecoder and Distribution module, parallelism control module, coprocessor module, and four instruction execution channels are used to run dedicated 32-bit DSP instructions and SIMD instructions based on DSP processing, and can run general-purpose RISC CPU instructions through instruction reconstruction .

指令存储器模块用于存储指令程序，采用程序ROM/RAM或cache结构，且在采用cache结构时支持多路组相连结构。指令存储器可以是程序ROM/RAM或cache结构(I cache)。I cache采用多路组相连结构，根据Icache大小的不同可以配置成不同的组相连形式。The instruction memory module is used to store instruction programs, adopts program ROM/RAM or cache structure, and supports multi-way group connection structure when adopting cache structure. The instruction memory can be a program ROM/RAM or a cache structure (I cache). The I cache adopts a multi-way group associative structure, which can be configured in different group associative forms according to the size of the Icache.

数据存储器模块用于存储数据和中间计算结果，能够被配置成完全的cache结构，或被配置成cache与高速暂存存储器(scratchpad memory)相结合的结构。完全的cache结构主要用来配合已有RISC CPU功能，避免兼容应用上出现问题。高速暂存存储器用于提供高速的数据存储，供处理器在执行DSP指令时使用；该高速暂存存储器中的有一部分空间专门作为栈空间，用于做DSP堆栈以支持栈指令的处理，栈的存储位宽为256bit，每行的256bit用于存放8个32位通用寄存器值，能够在一个时钟周期内实现8个寄存器数据同时的入栈或出栈操作，将8个寄存器数据同时堆入栈中或者将栈中数据取出按出栈指令的要求存入到8个寄存器中相应的寄存器。这样位宽的堆栈处理设计可以在系统发生意外或中断时只用较少的时钟周期保存好当前寄存器中的值，提高了系统响应速度。The data storage module is used to store data and intermediate calculation results, and can be configured as a complete cache structure, or as a combination of cache and scratchpad memory. The complete cache structure is mainly used to cooperate with existing RISC CPU functions to avoid problems in compatible applications. The high-speed temporary storage memory is used to provide high-speed data storage for the processor to use when executing DSP instructions; a part of the high-speed temporary storage memory is specially used as a stack space, which is used as a DSP stack to support the processing of stack instructions. The storage bit width is 256bit, and the 256bit of each row is used to store 8 32-bit general-purpose register values, which can realize the simultaneous push or pop operation of 8 register data in one clock cycle, and stack 8 register data into the stack at the same time In the stack or the data in the stack is taken out and stored in the corresponding registers in the 8 registers according to the requirements of the pop-out instruction. Such a bit-wide stack processing design can save the value in the current register with fewer clock cycles when an accident or interruption occurs in the system, which improves the system response speed.

装载存储单元主要用于处理器在RISC CPU模式下支持Load/Store指令的操作，它连接系统总线和数据高速缓存存储器。The load storage unit is mainly used for the processor to support the operation of the Load/Store instruction in the RISC CPU mode, and it connects the system bus and the data cache memory.

寄存器堆模块是在现有RISC通用处理器寄存器堆(R1～Rn)的基础上增加32个DSP专用的32bit寄存器(RD0～RD31)而形成的，这些寄存器主要用来保存乘累加等复杂运算结果以及存储器访问地址等。该32个DSP专用的32bit寄存器分成4组数据寄存器和8组地址寄存器，数据寄存器和地址寄存器分别都由16个32bit的寄存器构成。每组数据寄存器的构成类型相同，每组数据寄存器包含有4个32bit寄存器，其中2个32bit的寄存器(ah，al)合并组成的一个64bit的乘累加(ACC)寄存器、1个32bit乘数寄存器M和1个32bit暂存寄存器T，乘数寄存器用作乘法指令中的一个操作数，暂存寄存器用来保存中间结果数据。每组地址寄存器的构成类型相同，每组地址寄存器包含4个16bit地址寄存器，其中一个16位地址基址(ARB)寄存器、一个16位地址(AR)寄存器、一个16位地址偏移(ARO)寄存器和一个16位循环/全局寻址地址(ARC)寄存器。其中，ARB和AR内容均为无符号数，ARO的内容为有符号数，ARC为无符号数。AR、ARB、ARO和ARC根据指令的不同，可以形成多种寻址方式。这样的寄存器分配使得4通道在做运算的过程中各自可以用一组寄存器，使得寄存器的使用不出现冲突。32个DSP专用的32bit寄存器和执行单元组成了面向存储器访问的乘累加结构，可以高效地完成DSP计算功能。The register file module is formed by adding 32 DSP-specific 32-bit registers (RD0~RD31) on the basis of the existing RISC general-purpose processor register file (R1~Rn). These registers are mainly used to store complex calculation results such as multiplication and accumulation. And memory access address, etc. The 32 DSP-specific 32bit registers are divided into 4 groups of data registers and 8 groups of address registers, and the data registers and address registers are composed of 16 32bit registers respectively. Each group of data registers has the same composition type. Each group of data registers contains 4 32bit registers, of which 2 32bit registers (ah, al) are combined to form a 64bit multiply-accumulate (ACC) register and a 32bit multiplier register. M and a 32bit temporary register T, the multiplier register is used as an operand in the multiplication instruction, and the temporary register is used to save the intermediate result data. Each group of address registers has the same composition type, and each group of address registers contains 4 16bit address registers, including a 16-bit address base (ARB) register, a 16-bit address (AR) register, and a 16-bit address offset (ARO) registers and a 16-bit cyclic/global addressing (ARC) register. Among them, the contents of ARB and AR are unsigned numbers, the contents of ARO are signed numbers, and ARC is an unsigned number. AR, ARB, ARO, and ARC can form multiple addressing modes according to different instructions. This kind of register allocation allows the four channels to use a set of registers in the process of doing calculations, so that there is no conflict in the use of registers. 32 DSP-specific 32-bit registers and execution units form a multiplication-accumulation structure for memory access, which can efficiently complete DSP calculation functions.

寄存器堆模块对于RISC CPU指令，其可见的寄存器数目和功能不变，因此RISC CPU指令只能对其可见的寄存器进行应用和修改。对于DSP指令，其能访问整个寄存器堆，并可以对寄存器的值进行修改。For RISC CPU instructions, the register file module has the same number of visible registers and functions, so RISC CPU instructions can only apply and modify its visible registers. For DSP instructions, it can access the entire register file and modify the value of the register.

另外，寄存器堆模块还包括一个32位的专用控制寄存器(dspCtrlReg)，用于指明DSP程序和数据的页地址；该专用控制寄存器不在通用寄存器堆中，包含14bit的DSP程序页地址、14bit的DSP数据页地址和4bit的DSP指令的状态位。4bit的DSP指令的状态位包括溢出、进位等。该专用控制寄存器中的页地址用于DSP的跳转指令中，用来与地址寄存器形成全局访问地址，DSP的跳转只能在页范围内进行跳转，该专用控制寄存器DSP本身无法修改，由通用处理器将其作为本身的协处理器寄存器设置和更改，为了避免错误，系统初始化时或使用DSP程序前，要设置该寄存器。In addition, the register file module also includes a 32-bit special control register (dspCtrlReg), which is used to specify the page address of the DSP program and data; this special control register is not in the general register file, and contains the 14-bit DSP program page Data page address and status bits of 4bit DSP instructions. The status bits of 4bit DSP instructions include overflow, carry, and so on. The page address in the special control register is used in the jump instruction of the DSP to form a global access address with the address register. The jump of the DSP can only jump within the page range, and the special control register DSP itself cannot be modified. It is set and changed by the general processor as its own coprocessor register. In order to avoid errors, this register should be set when the system is initialized or before using the DSP program.

取指模块用于从程序存储器中取出指令，处理器可支持的最大指令分发并行度为4，为了保证指令的正确分发，指令槽宽度为16条指令，每当8条指令分发完毕以后，取指模块从程序存储器中取出新的8条指令放入指令槽。The instruction fetch module is used to fetch instructions from the program memory. The maximum instruction distribution parallelism that the processor can support is 4. In order to ensure the correct distribution of instructions, the instruction slot width is 16 instructions. It means that the module fetches 8 new instructions from the program memory and puts them into the instruction slot.

指令的预解码和分发模块用于通过解码指令的前8位来识别指令是属于RISC CPU指令、DSP指令或者SIMD指令，并根据指令中所体现的并行度情况进行指令分发。The instruction pre-decoding and distribution module is used to identify whether the instruction belongs to RISC CPU instruction, DSP instruction or SIMD instruction by decoding the first 8 bits of the instruction, and distribute the instruction according to the degree of parallelism embodied in the instruction.

并行度控制模块用于根据预解码模块解码出的指令类和相应比特位信息生成并行度控制信号反馈给分发模块。The parallelism control module is used to generate a parallelism control signal and feed it back to the distribution module according to the instruction class decoded by the pre-decoding module and the corresponding bit information.

协处理器用于对整体处理器进行控制，并与已有RISC通用处理器的控制保持一致，包括各种状态寄存器控制、运行模式和中断控制；同时，协处理器中还提供浮点运算单元，进一步增强处理器的数据处理能力。The coprocessor is used to control the overall processor, and is consistent with the control of the existing RISC general-purpose processor, including various status register control, operation mode and interrupt control; at the same time, the coprocessor also provides a floating-point arithmetic unit, Further enhance the data processing capability of the processor.

本发明采用了4发射结构来提高指令执行的并行度，处理器内部有4个指令执行通道，4个通道结构完全是同构的，每一个通道都可以独立完成所有的指令，因此指令发射顺序不会受通道号的限制。每个执行通道中主要包括：指令解码单元、程序流控制单元，状态控制单元、地址生成(DAG)单元、状态寄存器和多个执行部件单元。执行部件主要有：存储控制单元、ALU(运算逻辑)单元、MAC(乘累加)单元和SHF(移位)单元。指令解码单元解码具体指令类型和所需要的操作数。程序流控制单元主要处理程序执行顺序，主要涉及跳转类指令。地址生成单元主要用于DSP指令，生成相应的访存地址并进行地址修改。状态寄存器和状态控制单元主要记录当前处理器的各种状态，为指令执行中所需要做的各种判断提供所需信息。The present invention uses a 4-launch structure to improve the parallelism of instruction execution. There are 4 instruction execution channels inside the processor, and the 4 channel structures are completely isomorphic. Each channel can independently complete all instructions, so the order of instruction emission Not limited by the channel number. Each execution channel mainly includes: instruction decoding unit, program flow control unit, state control unit, address generation (DAG) unit, state register and multiple execution unit units. The execution components mainly include: storage control unit, ALU (operational logic) unit, MAC (multiply and accumulate) unit and SHF (shift) unit. The instruction decoding unit decodes specific instruction types and required operands. The program flow control unit mainly deals with the program execution sequence, mainly involving jump instructions. The address generation unit is mainly used for DSP instructions to generate corresponding memory access addresses and modify addresses. The state register and the state control unit mainly record various states of the current processor, and provide required information for various judgments that need to be made during instruction execution.

该处理器装置采用多级流水结构，流水结构可以根据应用要求做多级流水处理，流水线完成的主要功能包括取指(IA)、分发(DP)、解码(ID)、访存(MEM)、传输(TR)、执行(EX)和回写(WB)等步骤。指令执行的大致流程如下：IA级从指令存储器中取出指令；随后在DP级进行指令预解码；生成相应的并行度信息后进行指令分发；分发到各个通道的指令在ID级完成详细的指令解码；MEM级用来访问数据存储器，完成存储器数据的读写；MEM中读出的数据在TR级存入数据暂存寄存器，同时在该级开始ALU、MAC等逻辑单元的操作；EX进行指令的执行；最终数据在WB级写入相应的寄存器。The processor device adopts a multi-stage pipeline structure, and the pipeline structure can perform multi-stage pipeline processing according to application requirements. The main functions completed by the pipeline include instruction fetch (IA), distribution (DP), decoding (ID), memory access (MEM), Steps such as transfer (TR), execute (EX) and write back (WB). The general process of instruction execution is as follows: IA level fetches instructions from the instruction memory; then performs instruction pre-decoding at DP level; generates corresponding parallelism information and then distributes instructions; instructions distributed to each channel complete detailed instruction decoding at ID level ;The MEM level is used to access the data memory and complete the reading and writing of the memory data; the data read from the MEM is stored in the data temporary storage register at the TR level, and at the same time the operation of the logical units such as ALU and MAC starts at this level; Execution; the final data is written to the corresponding registers at the WB level.

处理器内部提供了旁路逻辑(bypass)来减少数据冲突，整个流水线中有3个旁路系统：一个从EX级到ID级的bypass通路、一个从TR级到ID级的通路和一个从EX级到TR级的bypass通路。同时处理器还提供了内锁(inter-lock)机制解决流水线或数据冲突，一旦有冲突发生，内锁机制会自动停止EX级以前的流水线动作。A bypass logic (bypass) is provided inside the processor to reduce data conflicts. There are 3 bypass systems in the entire pipeline: a bypass path from the EX stage to the ID stage, a path from the TR stage to the ID stage, and a path from the EX stage to the ID stage. The bypass path from level to TR level. At the same time, the processor also provides an inter-lock mechanism to resolve pipeline or data conflicts. Once a conflict occurs, the inter-lock mechanism will automatically stop the pipeline operations before the EX level.

DSP指令用于向寄存器数据进行操作，或者直接面向存储器数据进行操作；DSP指令能够支持32位的数据操作，或支持2个16位数据的并行操作。在专用DSP指令编码中，前6bit为指令标识，另外有2bit位来标识其与前后指令的并行度，后24bit为具体的DSP指令；专用DSP指令主要包括算术、乘累加、逻辑、移位、条件跳转和数据存储类指令，并提供了多种寻址模式和快速的数据堆栈处理。SIMD指令编码的前8bit为指令标识，其后的24bit编码规则和DSP指令后24bit编码规则相同，进行的操作也相同；使用SIMD指令时，4个通道都开始工作，进行相同的指令操作，只是对应的寄存器有所不同；SIMD指令本身没有并行度，每个时钟只能处理1条SIMD指令。该处理器装置采用4通道指令并行发射机制，最大支持一个时钟周期发射4条指令。DSP instructions are used to operate on register data, or directly operate on memory data; DSP instructions can support 32-bit data operations, or support two parallel operations of 16-bit data. In the special DSP instruction coding, the first 6 bits are the instruction identification, another 2 bits are used to identify the degree of parallelism between it and the preceding and following instructions, and the last 24 bits are specific DSP instructions; the special DSP instructions mainly include arithmetic, multiply and accumulate, logic, shift, Conditional jump and data storage instructions, and provides a variety of addressing modes and fast data stack processing. The first 8 bits of the SIMD instruction code are the instruction identifier, and the following 24bit encoding rules are the same as the 24bit encoding rules of the DSP instruction, and the operations are also the same; when using the SIMD instruction, all 4 channels start working and perform the same instruction operation, but The corresponding registers are different; SIMD instructions themselves have no parallelism, and each clock can only process 1 SIMD instruction. The processor device adopts a 4-channel instruction parallel emission mechanism, and supports a maximum of 4 instructions in one clock cycle.

再次参照图1，图1是本发明提供的处理器装置的整体结构示意图。4个通道共用相同的寄存器堆和存储器空间。取指单元从I cache中取出指令后预解码和分发单元进行指令类型的判断并把指令类型送入并行度控制模块，并行度控制模块根据指令信号判断并行度，分发单元进行指令分发。如果指令并行度为1，一条指令放入通道1；如果指令并行度为2，两条指令放入通道1和通道2；如果指令并行度为3，三条指令放入通道1、2和3。各个通道根据不同指令完成各自的任务，如果一旦有一个通道存在冲突需要流水线暂时停止，则所有通道同时暂停。Referring to FIG. 1 again, FIG. 1 is a schematic diagram of the overall structure of a processor device provided by the present invention. The 4 channels share the same register file and memory space. After fetching the instruction from the I cache, the pre-decoding and distribution unit judges the instruction type and sends the instruction type to the parallel degree control module. The parallel degree control module judges the degree of parallelism according to the instruction signal, and the distribution unit performs instruction distribution. If the instruction parallelism is 1, one instruction is put into lane 1; if the instruction parallelism is 2, two instructions are put into lanes 1 and 2; if the instruction parallelism is 3, three instructions are put into lanes 1, 2, and 3. Each channel completes its own tasks according to different instructions. If there is a channel conflict and the pipeline needs to be temporarily stopped, all channels will be suspended at the same time.

图2是本发明提供的处理器装置的流水线的示意图。其中，Fig. 2 is a schematic diagram of the pipeline of the processor device provided by the present invention. in,

IA：I cache中读出指令数据(一次读出4word指令)，将相应的指令送到指令槽；IA: Read instruction data from the I cache (read 4word instructions at a time), and send the corresponding instructions to the instruction slot;

DP：预解码指令槽中指令的头8位操作码，确定指令的类型以及并行度，进而根据并行度把指令槽中的指令分发到各个通道中；DP: pre-decode the first 8-bit opcode of the instruction in the instruction slot, determine the type of instruction and the degree of parallelism, and then distribute the instructions in the instruction slot to each channel according to the degree of parallelism;

DC：进行指令译码，如果需要访问MEM数据，则读取地址寄存器，得出访问存储器或寄存器的地址，以及相应的读写信号和控制信号，同时地址寄存器AR也在该级内进行地址后修改；DC: Perform instruction decoding. If access to MEM data is required, read the address register to obtain the address of the access memory or register, as well as the corresponding read and write signals and control signals. At the same time, the address register AR also performs addressing in this level. Revise;

MEM：该时钟周期主要根据指令类型以及地址进行存储器数据读写；MEM: This clock cycle mainly reads and writes memory data according to the instruction type and address;

TR：对于数据转移类指令，进行寄存器的写操作，对寄存器堆的数据load操作均在这一级完成；把从存储器读出的数据做缓存，暂存器中的数据和来自寄存器堆的数据被送入移位和波兹(booth)编码化简单元进行运算化简。RISC通用处理器部分ALU指令执行；TR: For data transfer instructions, the write operation of the register is performed, and the data load operation of the register file is completed at this level; the data read from the memory is cached, the data in the temporary register and the data from the register file It is fed into shift and booth coded simple elements for operational simplification. Part of the ALU instruction execution of the RISC general-purpose processor;

EX：把移位和booth编码计算的中间结果存入中间结果寄存器，进行64bit的加法阵列化简以及其他的ALU操作；EX: Store the intermediate results of the shift and booth encoding calculations into the intermediate result registers, and perform 64-bit addition array simplification and other ALU operations;

WB：更新相关寄存器。WB: Update related registers.

图3是本发明提供的处理器装置中DSP指令和SIMD指令的编码结构示意图。DSP的指令主要包括：ALU(算术逻辑)类指令、MAC(乘累加)类指令、条件跳转类指令和数据存储类指令。DSP扩展指令中，前6bit为指令标识码，随后2bit分别表示该条指令和它前后指令的并行关系：1表示并行，0表示不并行。后24bit为具体DSP指令，aaa表示指令操作中用的8个地址寄存器组中的某一个；指令编码中iiiiiiii表示立即数；dd表示指令操作所用的4个数据寄存器组中的一个；ssssss为移位控制码字，支持最多左移31位，右移32位；m表示所述的操作是16bit的还是32bit的；xxxxxxx为寻址模式，包括寄存器寻址和存储器寻址，针对存储器寻址，还进一步支持地址递增、递减、地址+偏移量、反转地址(reversedaddress)以及连接地址(linked address)等模式。SIMD类型指令的前8bit用来做指令类型标识，后24bit的编码规则和DSP类指令后24bit的编码规则相同。SIMD指令编码中没有指示并行度的比特位，对于SIMD指令，每个时钟周期发射1条指令，4个通道做相同的操作。Fig. 3 is a schematic diagram of the encoding structure of DSP instructions and SIMD instructions in the processor device provided by the present invention. DSP instructions mainly include: ALU (arithmetic logic) instructions, MAC (multiply and accumulate) instructions, conditional jump instructions and data storage instructions. In the DSP extended instruction, the first 6 bits are the instruction identification code, and the next 2 bits respectively represent the parallel relationship between the instruction and its preceding and following instructions: 1 means parallel, 0 means no parallel. The last 24bit is the specific DSP instruction, aaa represents one of the 8 address register groups used in the instruction operation; iiiiiiiii in the instruction code represents the immediate value; dd represents one of the 4 data register groups used in the instruction operation; ssssss represents the shift Bit control codeword, supports up to 31 bits left and 32 bits right; m indicates whether the operation is 16bit or 32bit; xxxxxxx is the addressing mode, including register addressing and memory addressing, for memory addressing, It further supports address increment, decrement, address + offset, reversed address and linked address modes. The first 8 bits of the SIMD type instruction are used to identify the instruction type, and the coding rules of the last 24 bits are the same as the coding rules of the last 24 bits of the DSP type instructions. There is no bit indicating the degree of parallelism in the SIMD instruction encoding. For SIMD instructions, 1 instruction is issued per clock cycle, and 4 channels perform the same operation.

图4是本发明中扩展的32个32bit位宽DSP专用寄存器的的分配情况示意图，根据寄存器的编号：前8个寄存器组成4个64bit乘累加(ACC)寄存器，接下来4个32bit乘数寄存器(M)用来储存乘法指令中用到的乘数，随后4个32bit暂存寄存器(T)可以用来做数据暂存，之后就是16个32bit寄存器组成的8组地址寄存器。Fig. 4 is the allocation situation schematic diagram of 32 32bit bit wide DSP special registers of expansion among the present invention, according to the numbering of register: first 8 registers form 4 64bit multiplication and accumulation (ACC) registers, next 4 32bit multiplier registers (M) is used to store the multiplier used in the multiplication instruction, and then four 32bit temporary registers (T) can be used for data temporary storage, followed by 8 sets of address registers composed of 16 32bit registers.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A processor device for seamlessly mixing a 32-bit DSP with a general RISC CPU is characterized by comprising an instruction memory module, a data memory module, a loading storage unit module, a register file module, an instruction fetching module, an instruction pre-decoding and distributing module, a parallelism control module, a coprocessor module and 4 instruction execution channels, wherein the instruction execution channels are used for running special 32-bit DSP instructions and SIMD instructions based on DSP processing, and can run general RISC CPU instructions in a compatible mode through instruction reconstruction.

2. The processor apparatus of claim 1, wherein the instruction memory module is configured to store an instruction program, to use a program ROM/RAM or cache architecture, and to support a multi-way set associative architecture when using a cache architecture.

3. The processor apparatus of claim 1, wherein the data memory module is configured to store data and intermediate computation results, and can be configured as a full data cache D-cache architecture, or as a combination of D-cache and scratch pad architecture.

4. The processor apparatus for seamlessly chaining a 32-bit DSP and a general purpose RISC CPU as claimed in claim 3, wherein said scratch pad memory is used to provide high speed data storage for use by the processor in executing DSP instructions; a part of space in the high-speed temporary storage memory is specially used as stack space and used for processing DSP stacks to support stack instructions, the storage bit width of the stacks is 256 bits, the 256 bits of each line are used for storing 8 32-bit general register values, the simultaneous stacking or unstacking operation of 8 register data can be realized in one clock cycle, and the 8 register data are simultaneously stacked or the stack data are taken out and stored into corresponding registers in 8 registers according to the requirement of the unstacking instruction.

5. The processor device according to claim 1, wherein said register file module is formed by adding 32 DSP-specific 32-bit registers based on the existing RISC general-purpose processor register file, said 32 DSP-specific 32-bit registers are divided into 4 sets of data registers and 8 sets of address registers, and said data registers and said address registers are respectively formed by 16 32-bit registers.

6. The processor apparatus of claim 5, wherein each set of data registers is of the same type and comprises 4 32-bit registers, wherein 2 32-bit registers are combined to form a 64-bit multiply-accumulate register, a 1-32-bit multiplier register M and a 1-32-bit scratch register T, the multiplier register is used as an operand in a multiply instruction, and the scratch register is used to store intermediate result data.

7. The processor device according to claim 5, wherein each set of address registers is of the same type, each set of address registers comprising 4 16-bit address registers, a 16-bit address base register, a 16-bit address offset register and a 16-bit coefficient address register.

8. The processor apparatus of claim 5, wherein the register file module is configured to seamlessly blend a 32-bit DSP with a general purpose RISC CPU, wherein the number and functions of visible registers of the register file module are unchanged for existing RISC CPU instructions, which apply and modify CPU-visible registers; the register file module can access the whole register file and modify the value of the whole register file for DSP special instructions.

9. The processor apparatus for seamlessly chaining a 32-bit DSP and a general purpose RISC CPU as claimed in claim 5, wherein said register file module further comprises a 32-bit dedicated control register for designating page addresses of DSP programs and data; the special control register is not in the general register file and comprises a 14-bit DSP program page address, a 14-bit DSP data page address and a 4-bit DSP instruction status bit; the page address in the special control register is used in a jump instruction of the DSP and is used for forming a global access address with the address register, the jump of the DSP can only be carried out in a page range, the DSP cannot be modified, and the special control register is used as a coprocessor register of the DSP to be set and changed by a general processor.

10. The processor apparatus according to claim 1, wherein said instruction fetch module is configured to fetch instructions from a program memory, the maximum instruction dispatch parallelism supportable by the processor is 4, the instruction slot width is 16 instructions to ensure correct instruction dispatch, and the instruction fetch module fetches a new 8 instructions from the program memory into the instruction slot each time 8 instructions are dispatched.

11. The processor apparatus according to claim 1, wherein said instruction predecoding and dispatch module is configured to identify whether an instruction belongs to RISC CPU instructions, DSP instructions, or SIMD instructions by decoding the first 8 bits of the instruction, and dispatch the instruction according to the parallelism embodied in the instruction.

12. The processor apparatus according to claim 1, wherein said parallelism control module is configured to generate parallelism control signals for feedback to the distributing module according to the instruction class decoded by the pre-decoding module and the corresponding bit information.

13. The processor apparatus of claim 1, wherein said coprocessor is used to control the entire processor and keep the same with the existing RISC general purpose processor, including various status register control, run mode and interrupt control; meanwhile, a floating point arithmetic unit is also provided in the coprocessor, so that the data processing capability of the processor is further enhanced.

14. The processor apparatus for seamless chaining a 32-bit DSP and a general-purpose RISC CPU according to claim 1, wherein 4 instruction execution channels within said processor are identical in structure, each channel being capable of independently completing all instructions; each channel comprises an instruction decoding unit, a program flow control unit, a state control unit, an address generation unit, a state register and a plurality of execution units; the execution component comprises: the device comprises a data storage control unit, an arithmetic logic unit, a multiply-accumulate unit and a shift unit.

15. The processor apparatus of claim 1, wherein the processor apparatus employs a multi-stage pipeline architecture, the pipeline mainly comprising instruction IA, dispatch DP, decode ID, access MEM, transfer TR, execute EX, and write-back WB.

16. The processor device for seamlessly chaining a 32-bit DSP and a general purpose RISC CPU as claimed in claim 15, wherein said processor device provides 3 bypass logics internally for reducing data collision, respectively: a bypass logic path from the EX stage to the ID stage, a bypass logic path from the TR stage to the ID stage, and a bypass logic path from the EX stage to the TR stage.

17. The processor device for seamlessly chaining a 32-bit DSP and a general purpose RISC CPU as recited in claim 16, wherein said processor device further employs an interlock mechanism to resolve pipeline or data conflicts, wherein said interlock mechanism automatically stalls the pipeline prior to the EX stage once a conflict occurs.

18. The processor apparatus for seamlessly chaining a 32-bit DSP and a general purpose RISC CPU as claimed in claim 1, wherein said dedicated DSP instruction is used to operate either register-oriented data or directly memory-oriented data; the DSP instructions can support 32-bit data operations or 2 parallel operations of 16-bit data.

19. The processor device for seamlessly mixing a 32-bit DSP and a general RISC CPU according to claim 1, wherein in the special DSP instruction coding, the first 6 bits are instruction marks, the other 2 bits are used for marking the parallelism between the special DSP instruction coding and the front and back instructions, and the last 24 bits are specific DSP instructions; the special DSP instruction mainly comprises arithmetic, multiply-accumulate, logic, shift, conditional jump and data storage instructions, and provides various addressing modes and rapid data stack processing.

20. The processor device for seamlessly blending a 32-bit DSP and a general RISC CPU according to claim 1, wherein the first 8 bits of the SIMD instruction encoding is an instruction identifier, the later 24-bit encoding rule is the same as the later 24-bit encoding rule of the DSP instruction, and the operation is the same; when the SIMD instruction is used, 4 channels start working and perform the same instruction operation, but the corresponding registers are different; the SIMD instructions themselves have no parallelism and can only process 1 SIMD instruction per clock.

21. The processor device of claim 1, wherein the processor device employs a 4-channel instruction parallel issue mechanism, supporting 4 instructions issue in a maximum of one clock cycle.