CN114020656B

CN114020656B - Non-blocking L1 Cache in Multi-core SOC

Info

Publication number: CN114020656B
Application number: CN202111305639.7A
Authority: CN
Inventors: 魏敬和; 刘德; 莫海
Original assignee: CETC 58 Research Institute; China Key System and Integrated Circuit Co Ltd
Current assignee: CETC 58 Research Institute; China Key System and Integrated Circuit Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2025-01-24
Anticipated expiration: 2041-11-05
Also published as: CN114020656A

Abstract

The invention relates to a non-blocking L1 Cache in a multi-core SOC, wherein the L1 Cache comprises two circuit modules, namely ICache and DCache, the ICache circuit module is designed to be an 8-way group-connected double-bank SRAM for providing data access for a CPU finger taking stage, the DCache circuit module is designed to be an 8-way group-connected single-bank SRAM for providing data access for a CPU LSU finger taking stage, the design of a control circuit and a pipeline in the ICache is realized, and the design of the control circuit, the pipeline, a replacement algorithm, a missing Cache queue (MSHR) and a consistency protocol in the DCache are realized. The non-blocking L1 Cache in the multi-core SOC solves the problems of low Cache access efficiency, difficult design and the like in the existing multi-core SOC.

Description

Non-blocking L1 Cache in multi-core SOC

Technical Field

The invention relates to the technical field of computer architecture, in particular to a non-blocking L1 Cache in a multi-core SOC.

Background

With the development of semiconductor technology, the performance of the memory is far from that of the processor, from the beginning of the 80 th of the 20 th century to the beginning of the 21 st century, the processor rapidly develops at about 50% per year, while the performance of the DRAM is improved by about 10% per year, the speed and bandwidth of the processor and the DRAM are seriously mismatched, and frequent accesses to the DRAM occupy a large amount of bus bandwidth. The locality principle of a computer program indicates that if a particular location of memory is accessed at a certain point, it is likely that the same location will be accessed again in the near future, and if a particular storage location is accessed at a particular time, it is likely that nearby storage locations will be accessed in the near future. Therefore, in general purpose processor CPUs, a small and fast cache is often used for caching data with locality to reduce access to DRAM by the processor, and when processing programs with locality, more than 90% of the data required by the processor can be hit in the cache, and the number of DRAM accesses is greatly reduced. The L1 Cache is used as a Cache for directly interacting with a CPU of a processor core, the access efficiency of the L1 Cache can directly influence the main frequency of the CPU, and the design of the L1 Cache is always an important point and a hot point in the field of computer architecture.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to solve the problem that the speed and the bandwidth of a processor and a DRAM are seriously not matched in the prior art, so as to provide the non-blocking L1 Cache in the multi-core SOC based on RISCV instruction sets.

In order to solve the above technical problems, the present invention provides a non-blocking L1 Cache in a multi-core SOC, where the L1 Cache includes two circuit modules, i.e., an ICache and a DCache, the ICache circuit module is designed to provide data access for a CPU finger fetching stage by 8-way group-associated dual bank SRAMs, and the DCache circuit module is designed to provide data access for a CPU LSU finger fetching stage by 8-way group-associated single bank SRAMs, and the method further includes:

When the data required by the core CPU of the SOC processor is missing in the Cache, the CPU is not blocked, the data is ready, an instruction which is not ready for the data is put into a missing processing queue, the next instruction is continuously executed, and after the L1 Cache is fetched from the low-level memory, the instruction is re-executed;

when the CPU of the SOC processor modifies data and writes the data back into the L1 Cache, the data in the L1 Cache is marked as dirty data, and when other CPUs in the SOC apply for access to the data, the local L1 caches of the other CPUs do not have the latest copy of the data because the data is only updated on the local L1 caches of the CPUs writing the data, and the data needs to be synchronized by using a consistency algorithm.

In one embodiment of the invention, recording the missing operation to the Cache miss Cache queue and notifying the CPU to re-execute the corresponding instruction after the data is fetched from the L1 Cache comprises:

Recording the address of the missing instruction into a missing cache queue, and informing a CPU to continue executing the subsequent instruction;

Simultaneously, the address of the missing data is sent to a bus, and the lower storage unit is informed to transmit the corresponding data;

And the required data is fetched back to the SRAM of the L1 Cache through the bus, and the previously stored missing instruction address is fetched from the missing Cache queue, so that the CPU is informed to execute the instruction again.

In one embodiment of the present invention, in the maintenance of data consistency among multiple cores, data has different data permissions among different cores, and the consistency protocol defines 4 different permissions, namely, no local data, local data but only read permission, local data and read/write permission, and dirty data, and the method specifically includes:

When the data of the request remote end is marked as dirty data, the dirty data needs to be written back to a lower-layer memory first, and meanwhile, the data is returned to a local cache, and the remote end and the local end have the latest data copies;

When the data of the far end is marked as dirty data, the data copy is firstly written back and simultaneously returned to the request end, at the moment, the far end data copy becomes Branch or Trunk, and when the request end writes the data, the local data is marked as dirty data and simultaneously the far end data is invalid;

when the data needs to be written, judging whether the authority corresponding to the local data is enough or not, if the authority is enough, directly writing the data after upgrading the data authority, and invalidating other core data and modifying the local data authority as Dirty.

In one embodiment of the present invention, there are two ways to write back the Dirty data, one is to write back the data during the consistency maintenance according to claim 3, and the other is that the CPU actively writes back the Dirty data, including:

When consistency maintenance involves dirty data, the data is written back to the next level of memory while re-consistency maintenance is required;

When the dirty data is accumulated too much locally, the CPU can actively carry out one-time pipelining treatment on all the dirty data.

In one embodiment of the present invention, a separate cache ICache is designed to provide instructions for the CPU fetch stage, including:

when instruction corresponding to the program counter PC is needed in the instruction fetch IF stage in the CPU pipeline, the instruction corresponding to the corresponding address is fetched from the ICache and returned to the CPU;

The ICache only needs to provide data access for the CPU in the instruction fetching stage, the CPU does not return data to the ICache, the data of the ICache is all from the lower-layer memory to write the ICache through the bus, and the data does not exist and becomes dirty data after being written by the CPU, so that consistency maintenance and support for data writing back do not need to be provided for the ICache.

In one embodiment of the present invention, a separate cache DCache is designed to provide access and caching of data for the SLU phase of the CPU, including:

When the load instruction is instructed, data corresponding to the request address is required to be provided for the CPU, when the save instruction is executed, the corresponding data is required to be cached to the corresponding address, and meanwhile, the authority of the data in the DCache is set to be Dirty;

Since the LSU module needs to access to the DCache and write data, so that the data in the DCache becomes dirty data, a consistency maintenance and write-back module needs to be added to the DCache, and in order to improve the performance of the CPU, when the data to be accessed by the LSU is not in the DCache, the CPU does not enter a waiting state due to the data miss, but instead, executes other instructions unrelated to the instruction, so that a miss cache processing mechanism needs to be added.

Compared with the prior art, the technical scheme of the invention has the advantages that DCache in the non-blocking L1 Cache in the multi-core SOC not only provides data access but also provides data Cache, the processor CPU can write into the Cache after data modification, at the moment, the caches of different processor CPUs may have old copies of the data and need to be accessed, and at the moment, the data needs to be transmitted to the target CPU through a write-back strategy by using a consistency protocol. Meanwhile, after instruction reordering, the correlation among different instructions can be judged, and when the data address to be accessed is missing in the Cache, the subsequent irrelevant instructions can be continuously executed, so that the design of DCache is usually non-blocking.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.

FIG. 1 is a block diagram of an ICache corresponding data_array according to an embodiment of the present invention

Fig. 2 is a diagram of an ICache architecture for one embodiment of the present invention.

FIG. 3 is a block diagram of a DCache corresponding data_array according to one embodiment of the present invention.

FIG. 4 is a diagram of a Dcache architecture according to one embodiment of the present invention.

Fig. 5 is a state machine diagram of a writeback module according to one embodiment of the invention.

FIG. 6 is a block state machine diagram that also provides prober for implementation of the present invention.

Detailed Description

The embodiment provides a non-blocking L1 Cache in a multi-core SOC, wherein the L1 Cache comprises two circuit modules, namely ICache and DCache, and the ICache circuit module is designed to provide data access for a CPU finger fetching stage by 8-path group-associated dual bank SRAM, and specifically comprises the following steps:

Receiving a request of a fetch stage in a CPU pipeline;

Dividing 11-6 bits of a requested address into 31-12 bits of dividing bits tag of an idx address, wherein idx is used for indexing a corresponding group number, and tag is used for comparing whether the group numbers are matched;

Taking out the corresponding valid bit and the tag through the idx, comparing whether 8 tags are matched with the tag requested by the CPU if the valid bit is valid, judging hit if the tags are matched, and returning corresponding data if the tags are not matched;

CPU requests data missing to request missing data from the bus;

when data is returned from the bus, the data is filled into a storage unit corresponding to the ICache, and the valid bit is set to be 1;

the CPU will loop the request for the instruction in the absence of the period of time, and will hit the request of the CPU after the data is filled, and execute the next instruction.

The DCache circuit module is designed to provide data access for the LSU finger fetching stage of the CPU for the 8-path group-connected single bank SRAM, and specifically comprises the following steps:

Receiving a data access and storage request of a CPU;

The 11-6 bits of the requested address are divided into 31-12 bits of the idx address, the idx is used for indexing the corresponding group number, and the tag is used for comparing whether the groups are matched.

And taking out the corresponding authority meta and the tags through the idx, comparing whether 8 tags are matched with the tags required by the CPU if the authority is matched with the operations required by the CPU, if so, performing the operations required by the CPU, returning corresponding data, and if not, reporting the missing.

Actively or passively writing back the data with the authority of Dirty;

when the data which is requested to be accessed by the remote end is dirty data locally, consistency maintenance is required;

When the data which needs to be accessed at present is locally missed, operating and storing an instruction corresponding to the missed data into a missed cache queue, requesting the missed data from a bus, and notifying a CPU to re-execute the instruction after waiting for the data to be returned from a lower layer storage;

the internal LR/SC consistency model addresses atomic memory access instructions.

The method additionally comprises the steps of:

Recording the missing operation to a Cache miss Cache queue, and notifying a CPU to re-execute a corresponding instruction after data retrieval of the L1 Cache, wherein the method comprises the following steps:

The maintenance of data consistency among multiple cores, wherein data have different data authorities among different cores, the consistency protocol defines 4 different authorities, namely, no local data exists, only read authorities exist, data exists locally, only read and write authorities exist, and dirty data exists at the same time, the method specifically comprises the following steps:

There are two ways to write back the Dirty data, one is to write back the data according to the consistency maintenance of claim 3, and the other is that the CPU actively writes back the Dirty data, including:

An independent cache ICache is required to be designed to provide instructions for a CPU instruction fetching stage, which comprises the following steps:

Further, the ICache designs a double bank structure connected with 8-path groups, and three storage structures are respectively vb_array, tag_array and data_array, and are respectively used for storing valid bits, tag data and specific instruction data. The valid bit of each data only needs 1 bit to be represented, the size of vb_array is designed to be 64x8 bits, each tag is provided with 20 bits of data, so the size of the tag is designed to be 64x8x20 bits;data_array as a storage structure for storing specific data, as shown in figure 1, the SRAM is divided into two banks and is designed to accelerate the access delay of the data, 8 single SRAMs in each bank are combined into one path, one row of the SRAM is designed to be 64 bits, and 4 rows of the SRAM form one block in one bank, so the size of the SRAM is 64x4x8x2x64 bits.

The location indicated by the dashed line in fig. 2 is a pipeline register, the ICache is designed as a 2-stage pipeline, divided into 3 stages, stage 0 being the request to accept the CPU and the refill request of the bus. When the req_fire signal is valid, the Cache starts hit judgment logic, after dividing the address, the Cache transmits the address to tag_array and data_array respectively, when d_fire of a D channel of the bus is valid, a refill logic is entered, the priority of the refill logic is higher than that of a CPU request logic, and otherwise, a missing request is always lost. The bit width of the bus is 128 bits, the size of each refil is 64B of one cache block, so that one block needs 4 cycles to transfer the corresponding data refil to a replacement path given by a replacement algorithm, when the refil_cnt counter counts from 0 to 3, the corresponding tag is written into the tag_array and the corresponding bit in the vb_array is set to 1 when the transmission is completed. The invalidated signal indicates that the entire ICache is now disabled for use with the ifence instruction, and that the ifence instruction will limit the order of instructions before and after execution, and will flush the ICache to avoid the effects of previous instructions.

Stage 1 of the ICache pipeline will obtain tag data and data read from the SRAM, compare the read 20x8 tag and valid bit identification with the CPU requested tag, hit if equal, s1_hit pull high into hit logic, and no into CACHE MISS logic.

Stage 3 of the ICache pipeline returns data corresponding to the CPU or requests missing data from the bus. If the bit missing logic will request the missing data from the lower memory through the B channel of the bus, and the replacement algorithm generates the needed replacement way memory to put the missing address into the register, and after the data is returned from the bus, the corresponding data will be written into the data_array and the tag_array through the replacement way and the missing address.

An independent cache DCache is designed to provide access and caching of data for the SLU stage of the CPU, including:

Furthermore, the DCache is designed into an 8-path group-connected single bank structure, and three caches are respectively arranged in the DCache and respectively used for storing authority data, tag data and access data. The cache is provided with 4 authorities which are Nothing, branch, trunk, dirty respectively and are indicated as no corresponding data in the cache, corresponding data in the cache but only capable of being read and written, corresponding data in the cache can be read and written, and corresponding data in the cache and are written by a CPU, so that each block needs 2 bits data to indicate the corresponding authorities, the size of meta_array is designed to be 2x8x64bits, the design of tag_array is consistent with that in ICache, the design of tag_array is designed to be 20x8x64 bits;data_array and is designed to be a single bank, as shown in figure 3, the cache consists of 8 SRAMs, 128 bits are used as one line in each SRAM, and 4 lines form one cache, and therefore the size of DCache is 128x8x4x64 bits and the size of DCache is consistent with that in ICache.

The position indicated by the dashed line in fig. 4 is a pipeline register, and DCache is designed as a 5-stage pipeline, divided into 6 stages. The first two stages of pipelines are used for processing non-atomic operations, and the last three stages of pipelines are mainly used for processing atomic operations. In DCache, there are 6 data streams, lsu_req, writeback, prober, prefetch, mshr _read, replay, respectively. The s0 stage is mainly data path selection, then an address arbiter module needing to read data is used for judging that the address is finally delivered to a meta authority data memory access module and a data access module for authority and data reading.

In the s1 stage, meta information and tag information are obtained, and in the stage, whether the operation of the req hits in the Cache or not can be judged by utilizing the meta information and tag information, s1_tag_eq_way is an 8-bit one-hot signal generated by a hit logic module, the hit logic module compares the tag of the req with the tag read in the meta, and when the coh is valid, the bit corresponding to s1_tag_eq_way is changed to 1. s1_meta/u read_way_en s1_meta_read_way_en are all one-hot signals and, the way_en select module will select one from among them as s1_match_way according to the type of s1_type. At the same time, the LSU module also transmits the branch prediction result to the stage, if the branch prediction is wrong, the execution of the instruction is canceled, and the subsequent pipeline operation is not performed so as to save energy consumption.

The data read by the data_array is obtained in the s2 stage, if the data is in the lsu_req data path, whether the data is hit or not is judged, corresponding writing is executed or the read data is returned to the LSU module, if the data is missing, the missing operation is recorded to the MSHR unit, the MSHR sends the address of the missing data to the bus, the corresponding data is waited to be always returned and recorded to the corresponding Cache, the LSU is notified to enter replay data, if the data is in the replay data path, the corresponding information is returned to the LSU unit in the s2 stage, if the data is in the replay data path, the address required to be prefetched is sent to the MSHR unit, the data is requested to be read from the bus through the MSHR unit, if the data is in the write back data path, the data required to be written back to the total is sent, meanwhile, the authority of the corresponding data is changed, if the corresponding data is prober data path, the write corresponding data is called to the bus, and the authority of the corresponding data is modified, and the MSHR logic is required to send the data read to the MSHR unit to the MSHR module for MSHR _readed.

In the s2 stage, the atomic operation is also judged, and before the atomic operation is performed, the process of acquiring the corresponding lock lr/sc instruction is needed to acquire the corresponding lock. LR (Load-Reserved) is a mutually exclusive read instruction, which is used to read data in 32 bits from memory (blocked by the rs1 register value) into which the lr_w and lr_d are decoded into the same instruction during decoding, W represents 4B alignment, and D represents 8B alignment, and stored in rsd registers. SC (Load-Reserved) is a mutually exclusive write instruction for writing 32-bit data (SC_W and SC_D are decoded into the same instruction in the decoding process) into a memory (the address is specified by the value of the rs1 register, W represents 4B alignment, D represents 8B alignment), and the value of the data is about the value in the rs2 register. The SC instruction is not necessarily successfully executed, and can be successfully executed only if the conditions that 1, LR and SC instructions have the same addresses accessed in pairs, 2, LR and SC instructions have no other operations accessing the same addresses, 3, LR and SC instructions have no interrupt exception, and 4, LR and SC instructions have no MRET instructions executed. The method specifically comprises the steps of 1, reading a value in a lock by using an LR instruction, 2, judging the read value, if the value in the lock is found to be 1, which means that the lock is occupied by other cores, continuing to return to the step 1, repeating the reading, if the value in the lock is found to be 0, which means that the value in the lock is empty, entering the step 3, writing 1 into the lock by using an SC, attempting to lock the lock, and then judging the return of an instruction result, if the return result means that the execution of the mutually exclusive write instruction is successful, which means that the locking is successful, otherwise, the locking is failed. Since the bus is not locked between the first read and the second write, other cores may access the lock, and other cores may find that the value in the lock is 0 and write the value 1 into the lock to attempt to lock, but monitoring in the system will ensure that the core that performs the exclusive write can succeed, and then the core that performs the exclusive write will fail, thereby ensuring that each intelligent has one core to lock successfully.

The writeback module is a writeback module, which is controlled by a state machine, and the state machine of the writeback has 5 states, which are s_invalid, s_fill_buffer, s_lsu_release, s_active, and s_grant respectively. State transitions are shown in fig. 5. s_invalid is the initial state, when the fire signal of req is valid, the state machine is started, the state machine jumps to s_fill_buffer, and simultaneously a data_req_cnt counter is initialized to 0, the acked is set to 0, in this state, the block pointed by the req address is read into wb_buffer, wb_buffer is a 4×128bits register, because 128bits are stored as a row in sram, and 4 rows are one block. The read address of meta authority data is directly req_idx, but it should be noted that the read address of data is { req_idx, data_req_cnt [1:0] }. The ready signal of wb_io_data_req is valid when the ready signal of the corresponding arbiter is pulled high, indicating that the corresponding data has been received, i.e. the data has been successfully received from the memory, when the ready signal of the arbiter is 1, the value +1 of data_req_cnt is used to indicate that the next row of data of the data is read. It should be noted that two beats are required for the wb_buffer assignment, which is to take two clock cycles from the time when the read address enters the memory to the time when the data arrives at the output, where the data with the wb_buffer [ r2_data_cnt ] being io_data_resp needs to be written until r2_data_req_fire is 1, and after the wb_buffer is completely buffered, the s_lsu_release state is entered, and at the same time, the data_req_cnt is assigned to 0. In the s_lsu_release state, the data of the wb_buffer is already cached as the req data, at this time, the data is converted into a TLBundleC structure by the ProbeAck _data function, and is delivered to the output io_lsu_release, and after the io_lsu_release data is correctly received, the s_active state is entered. The data is transmitted to the bus in the s_active state, the release_data or ProbeAck _data function is selected according to the automatic or passive state respectively, the data is converted into TLBundleC type data and transmitted to the bus, the value of the data_req_cnt is 0 at the beginning, and the value of the data_req_cnt is +1 every time the bus successfully receives one data. after the transmission is finished, the s_grant state or the s_invalid state is respectively entered according to whether the active write-back is performed. Note that in this state the mem_grant signal may also be received, and after receipt of this signal the acked is set to 1, indicating that it may end directly. The state of s_grant waits for the io_mem_grant signal, sets the acked to 1 when the grant signal is received, and ends after the acked is 1.

The prober modules have 11 states, s_invalid,s_meta_read,s_meta_resp,s_mshr_req,s_mshr_resp,s_lsu_release,s_release,s_writeback_req,s_writeback_resp,s_meta_write,s_meta_write_resp, respectively, their state transitions are shown in fig. 6. s_invalid is the initial state when the B channel requests probe, the state jumps to the s_meta_read state; the method comprises the steps of receiving a request from a buffer, wherein the request is processed by a buffer, a buffer_read state is a read request state, the buffer_read state is a read request address permission and tag information, at the moment, the valid signal of an output port meta_read is pulled high, an address message of a request is assigned to a meta_read structure, waiting for metaReadArb to process the request, when the request is processed, feeding back the read signal is pulled high, the state is skipped to a meta_resp state, the s_meta_resp state is a read state of waiting for meta and tag data, after a period, the state is directly skipped to a s_ mshr _req state, under the s_ mshr _req state, the old_core register is updated with meta permission and tag of the requested data, and the valid_en register is valid, whether a block of the requested is in the buffer can be judged by being in the buffer according to bit exclusive OR, namely, the valid value of the tag_matches signal is obtained, the state is changed to be 1, the original state is selected to be updated by the buffer_resp state, the onProbe function is skipped to the original state according to the original state, after the buffer_request is changed to the buffer_request state, the buffer_core is changed to the current state, and the buffer_read state is changed to the buffer_buffer_read state is changed to the current state, and the buffer_buffer_has been changed to the buffer_buffer_request state. In the s_write_req state, if the data exists in the cache, the corresponding block needs to be written back, in the state, after the information of the write back req is submitted to an arbiter of the write back and is received by the write back module, the s_write_resp state is ready for completion of writing back, in the s_write_resp state, the corresponding block needs to be skipped to the s_meta_write state when the data exists in the cache, the i_w_req_request signal is connected with the wbArb arbiter, the request signal of the req interface finally connected to the write back module is 1 only when the s_index signal is received by the write back module, the s_back_resp state can cause the write back module to skip out the s_index state when the request signal is in the s_write_resp state, the corresponding to the 1, the corresponding block needs to be ready to enter the write-back state when the data is in the direct state, and the corresponding block needs to enter the corresponding block is in the write-back state when the data is in the direct state, the corresponding block is ready to enter the data is in the direct state, and the corresponding block needs to enter the corresponding state when the data is in the write_write_back state, and the corresponding block needs to be written back to the corresponding state, and the corresponding block is in the direct state, and the current to be written back to the data is in the state.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A non-blocking L1 Cache in a multi-core SOC, wherein the L1 Cache includes two circuit modules, an ICache and a DCache, wherein the ICache circuit module is designed as an 8-way set-associative dual-bank SRAM to provide data access for a CPU instruction fetch stage, and the DCache circuit module is designed as an 8-way set-associative single-bank SRAM to provide data access for a CPU LSU instruction fetch stage, and further includes:

When the data needed by the core CPU of the SOC processor is missing in the cache, the CPU will not be blocked. After waiting for the data to be ready, the instruction with unready data will be put into the missing processing queue and the next instruction will be executed. When the data is retrieved from the lower memory to the L1 cache, the instruction will be re-executed.

When the core CPU of the SOC processor modifies data and writes it back to the L1 Cache, the data in the L1 Cache is marked as dirty data. At the same time, when other CPUs in the SOC apply to access the dirty data, since the dirty data is only updated in the local L1 Cache of the CPU that writes the data, the local L1 Cache of other CPUs does not have the latest copy of the dirty data, and the consistency algorithm needs to be used to synchronize the dirty data;

To maintain data consistency between multi-core SOCs, different cores will have different data permissions. The consistency protocol defines four different permissions: no local data, local data but only read permission, local data with read and write permissions, local data with read and write permissions and dirty data, including:

When the required data is not available locally and the current request is to read data, a request needs to be sent to the bus to read data from the remote end; when the data requested from the remote end is not marked as dirty data, the requested data is returned to the requesting end, and both the local and remote ends have the latest copies; when the data requested from the remote end is marked as dirty data, the dirty data needs to be written back to the lower-level memory first, and the data is returned to the local cache at the same time, and both the remote and local ends have the latest copies of the data;

When the required data is not available locally and the current request is to write data, a data request needs to be sent to the bus to read data from the remote end; when the data requested from the remote end is not marked as dirty data, a copy of the data is returned to the requesting end. After the local write is performed, the local copy of the data is dirty data and the remote copy of the data is invalidated; when the data requested from the remote end is marked as dirty data, a copy of the data needs to be written back first and the data is returned to the requesting end. At this time, the remote data copy becomes Branch or Trunk. After the requesting end writes data, the local data is marked as dirty data and the remote data is invalidated;

When there is a local copy of the data, it means that there is no dirty data in the local cache of all cores. When the data needs to be read, it is read directly through the local cache; when the data needs to be written, first determine whether the permissions corresponding to the local data are sufficient. If the permissions are sufficient, the data is written directly. Otherwise, the data permissions need to be upgraded before writing the data. At the same time, other core data are invalidated and the local data permissions are changed to Dirty.

2. The non-blocking L1 Cache in a multi-core SOC according to claim 1, characterized in that the missing operation is recorded in the cache miss queue, and the CPU is notified to re-execute the corresponding instruction after the data is retrieved from the L1 Cache, comprising:

Record the address of the missing instruction into the miss cache queue and notify the CPU to continue executing subsequent instructions;

At the same time, the address of the missing data is sent to the bus to notify the lower storage unit to transmit the corresponding data;

The required data is retrieved to the SRAM of the L1 Cache through the bus, and the previously stored missing instruction address is taken out from the missing cache queue to notify the CPU to execute the instruction again.

3. The non-blocking L1 Cache in the multi-core SOC according to claim 2, characterized in that the Dirty data needs to be written back, and there are two ways of writing back, one is to write back the data according to the maintenance of data consistency between the multi-core SOCs, and the other is that the CPU actively writes back the Dirty data, including:

When consistency maintenance involves dirty data, the data needs to be written back to the next level of storage while maintaining consistency;

When too much dirty data accumulates locally, the CPU will actively process all the dirty data in a pipeline at one time.

4. The non-blocking L1 Cache in a multi-core SOC according to claim 3 is characterized in that an independent cache ICache needs to be designed to provide instructions for the CPU instruction fetch stage, including:

When the instruction fetch IF stage in the CPU pipeline needs the instruction corresponding to the program counter PC, the instruction corresponding to the corresponding address will be taken out from the ICache and returned to the CPU;

ICache only needs to provide data access to the CPU during the instruction fetch stage. The CPU does not return data to ICache. All data in ICache comes from the lower-level memory writing to ICache through the bus. Data that does not exist becomes dirty data after being written by the CPU. Therefore, there is no need to provide consistency maintenance and data write-back support for ICache.

5. The non-blocking L1 Cache in a multi-core SOC according to claim 3 is characterized in that an independent cache DCache needs to be designed to provide data access and cache for the SLU stage of the CPU, including:

When a load instruction is issued, the data corresponding to the requested address needs to be provided to the CPU. When a save instruction is executed, the corresponding data needs to be cached to the corresponding address, and the permission of the data in the DCache is set to Dirty.

Since the LSU module not only needs to access the DCache, but also needs to write data, which causes the data in the DCache to become dirty data, a consistency maintenance and write-back module needs to be added to the DCache. At the same time, in order to improve the performance of the CPU, when the data that the LSU needs to access is not in the DCache, the CPU will not enter a waiting state due to the lack of the data, but will instead execute other instructions that are not related to the instruction. Therefore, a missing cache processing mechanism needs to be added.