CN114579505A

CN114579505A - Chip and inter-core communication method

Info

Publication number: CN114579505A
Application number: CN202011387367.5A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2022-06-03

Abstract

A chip and an inter-core communication method are disclosed. The chip includes: a plurality of processing core groups, each processing core group comprising at least one processing core, any one of the plurality of processing core groups being connected to others of the plurality of processing core groups via a network on chip NOC; a plurality of local shared memory units (LMSUs) in one-to-one correspondence with the plurality of processing core groups, the LMSUs being connected to all processing cores in the corresponding processing core groups through a first level node of the NOC; a broadcast station connected to a second level node of the NOC and the plurality of LMSUs, respectively.

Description

Chip and inter-core communication method

Technical Field

The present disclosure relates to the field of chip technology, and more particularly, to a chip and an inter-core communication method for the chip, which can improve the efficiency of inter-core communication, reduce the complexity of a communication network, save a chip area, and reduce costs.

Background

With increasing demands on chip performance, it has become a trend to integrate multiple processor cores in a single chip. In many-core or multi-core structures, inter-core communication has always been an important and difficult point. Although Networks On Chips (NOCs) have been proposed as a communication method for system on chip (SoC) and have significantly better performance than conventional bus-based systems, the transmission of large amounts of data between cores typically causes network on chip congestion, resulting in excessively long transmission delays. Meanwhile, as the local dedicated memory is often required to be added due to the synchronization problem between the cores, the chip area is increased and the resources are wasted. The reason for this is that some data is shared but multiple copies need to be replicated.

Disclosure of Invention

Embodiments of the present disclosure provide a chip, an inter-core communication method, an electronic device, and a computer-readable storage medium, which are implemented to improve inter-core communication efficiency, reduce communication network complexity, save chip area, and reduce cost.

In one general aspect, there is provided a chip, comprising: a plurality of processing core groups, each processing core group comprising at least one processing core, any one of the plurality of processing core groups being connected to others of the plurality of processing core groups via a network on chip NOC; a plurality of local shared memory units (LMSUs) in one-to-one correspondence with the plurality of groups of processing cores, the LMSUs being connected to all processing cores within the corresponding groups of processing cores via a first level node of the NOC; a broadcast station connected to a second level node of the NOC and the plurality of LMSUs, respectively.

By setting an LMSU for each processing core group in the chip and correspondingly setting a broadcast station, proper scheduling can be realized in inter-core communication, thereby effectively avoiding data transmission congestion in the chip and reducing transmission delay in inter-core communication. In addition, the LMSU and the broadcasting station are arranged in the chip, so that the amount of the special memory in the chip is reduced, the chip area can be saved, and the cost can be reduced. Meanwhile, since a plurality of LMSUs correspond to a plurality of processing core groups one to one in a chip, the chip can be easily expanded.

Optionally, the LMSU comprises: a receiving directory for recording data indexes of the processing cores in the corresponding processing core group as receiving cores; a sending enable chain, configured to configure each receiving core in the current transmission for the processing core serving as the sending core in the corresponding processing core group; a receive enable chain, configured to configure a working state of each processing core in the corresponding processing core group as the receive core; a local memory for storing data; and a monitoring queue for monitoring a communication port between the LMSU and the broadcast station.

The RX directory, the transmission enabling chain, the receiving enabling chain and the monitoring queue in the LMSU can be flexibly configured, so that the conditions of single-core transmission and single-core reception are compatible with the conditions of single-core transmission and multi-core reception, the efficiency of inter-core communication can be improved, and the complexity of a communication network is reduced.

Optionally, the transmit enable chains in a plurality of the LMSUs are concatenated and mapped to the broadcast station.

In another general aspect, there is provided an inter-core communication method for a chip comprising a plurality of groups of processing cores, said groups of processing cores comprising at least one processing core, each of said groups of processing cores being connected to a local shared memory unit LMSU via a network on chip NOC; the inter-core communication method comprises the following steps: one processing core serving as a sending core accesses an LMSU corresponding to the other processing core serving as a receiving core through the NOC; determining whether the receiving core is ready for receiving according to the parameters in the corresponding LMSU; and in response to the receiving core being ready to receive, the transmitting core transmitting data to the receiving core through the NOC, broadcast stations and the corresponding LMSUs, wherein the broadcast stations are connected to the NOC and all of the LMSUs, respectively.

By the inter-core communication method, congestion of data transmission in a chip can be effectively avoided, transmission delay in inter-core communication is reduced, inter-core communication efficiency can be improved, and communication network complexity is reduced.

Optionally, the inter-core communication method further includes: in response to the receiving core not being ready to receive, the transmitting core transmits data to the corresponding LMSU through the NOC.

In another general aspect, there is provided an inter-core communication method for a chip comprising a plurality of groups of processing cores, said groups of processing cores comprising at least one processing core, each of said groups of processing cores being connected to a local shared memory unit LMSU via a network on chip NOC; the inter-core communication method comprises the following steps: a processing core as a sending core sends configuration information to the LMSU through a Network On Chip (NOC) and a broadcasting station, wherein the broadcasting station is respectively connected with a second-level node of the NOC and all the LMSUs; in response to receiving a configuration complete message sent by the broadcast station, the sending core sends the data to each of the LMSUs through a first level node of the NOC and the broadcast station.

Optionally, the first LMSU determines, according to the configuration information, that there are multiple processing cores serving as receiving cores in a processing core group to which the first LMSU is connected; if the first LMSU determines that at least one receiving core in the plurality of receiving cores is ready for receiving, the first LMSU sends the data to any receiving core which is ready for receiving, and stores the data; or if the first LMSU determines that the receiving cores are not ready for receiving, the first LMSU saves the data, wherein the first LMSU is any one of all the LMSUs.

In another general aspect, there is provided an inter-core communication method for a chip comprising a plurality of groups of processing cores, said groups of processing cores comprising at least one processing core, each of said groups of processing cores being connected to a local shared memory unit LMSU via a network on chip NOC; the inter-core communication method comprises the following steps: responding to an instruction executed to receive data, and inquiring the LMSU corresponding to the first processing core by the first processing core; and in response to querying a record in the LMSU indicating that the first processing core is a receiving core, the first processing core reads data from the LMSU.

Optionally, in response to not querying a record in the LMSU indicating the first processing core as a receiving core, the first processing core configures the LMSU to generate a record in the LMSU indicating that the first processing core is ready to receive, and a record indicating that the LMSU monitors data sent by the broadcast station.

Optionally, after the first processing core configures the LMSU to generate a record in the LMSU indicating that the first processing core is ready to receive, the inter-core communication method further includes: and responding to the data sent by the broadcast station monitored by the LMSU, and the first processing core acquires the data sent by the broadcast station from the LMSU.

In another general aspect, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements an inter-core communication method as described above.

In another general aspect, there is provided an electronic device, including: a processor implemented by a chip as described above; a memory storing a computer program to be executed by the processor.

In another general aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements an inter-core communication method as described above.

According to the chip, the inter-core communication method, the electronic device and the computer readable storage medium disclosed by the embodiment of the disclosure, the problems of overlong transmission delay, large occupied area of a memory and easy deadlock caused by transmission in the traditional multi-core architecture can be solved, and the chip, the inter-core communication method, the electronic device and the computer readable storage medium have the advantages of short transmission delay, flexible configuration, memory area saving, easy expansion and the like.

Drawings

The above and other objects and features of the embodiments of the present disclosure will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate the embodiments by way of example, in which:

FIG. 1 is a schematic diagram illustrating a chip according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating an inter-core communication method according to one embodiment of the present disclosure;

FIG. 3 is a flow diagram illustrating an inter-core communication method according to another embodiment of the present disclosure;

FIG. 4 is a flow diagram illustrating an inter-core communication method according to yet another embodiment of the present disclosure;

fig. 5 is a block diagram illustrating an electronic device according to an embodiment of the present disclosure.

Detailed Description

The following detailed description is provided to assist the reader in obtaining a thorough understanding of the methods, devices, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatus, and/or systems described herein will be apparent to those skilled in the art after reviewing the disclosure of the present application. For example, the order of operations described herein is merely an example, and is not limited to those set forth herein, but may be changed as will become apparent after understanding the disclosure of the present application, except to the extent that operations must occur in a particular order. Moreover, descriptions of features known in the art may be omitted for clarity and conciseness.

The features described herein may be embodied in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein have been provided to illustrate only some of the many possible ways to implement the methods, apparatus and/or systems described herein, which will be apparent after understanding the disclosure of the present application.

As used herein, the term "and/or" includes any one of the associated listed items and any combination of any two or more.

Although terms such as "first", "second", and "third" may be used herein to describe various elements, components, regions, layers or sections, these elements, components, regions, layers or sections should not be limited by these terms. Rather, these terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first member, first component, first region, first layer, or first portion referred to in the examples described herein can also be referred to as a second member, second component, second region, second layer, or second portion without departing from the teachings of the examples.

In the specification, when an element (such as a layer, region or substrate) is described as being "on," "connected to" or "coupled to" another element, it can be directly on, connected to or coupled to the other element or one or more other elements may be present therebetween. In contrast, when an element is referred to as being "directly on," "directly connected to," or "directly coupled to" another element, there may be no intervening elements present.

The terminology used herein is for the purpose of describing various examples only and is not intended to be limiting of the disclosure. The singular is also intended to include the plural unless the context clearly indicates otherwise. The terms "comprises," "comprising," and "having" specify the presence of stated features, quantities, operations, elements, components, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, quantities, operations, components, elements, and/or combinations thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs after understanding the present disclosure. Unless explicitly defined as such herein, terms (such as those defined in general dictionaries) should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and should not be interpreted in an idealized or overly formal sense.

Further, in the description of the examples, when it is considered that detailed description of well-known related structures or functions will cause a vague explanation of the present disclosure, such detailed description will be omitted.

Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Embodiments, however, may be embodied in various forms and are not limited to the examples described herein.

Fig. 1 is a schematic diagram illustrating a chip according to an embodiment of the present disclosure.

Referring to fig. 1, a chip according to an embodiment of the present disclosure may include a plurality of processing

core groups

11, 12, 13, and 14, a plurality of local shared memory units (LMSUs) 21, 22, 23, and 24, and a broadcasting station 30. Furthermore, a chip according to embodiments of the present disclosure may also include a Network On Chip (NOC), where the NOC includes first level nodes NOC00, NOC01, NOC02, and NOC03, and second level nodes NOC 20. Although the number of first level nodes of the plurality of groups of processing cores, the plurality of LMSUs, and the NOC is illustrated as 4 in the drawings, embodiments of the present disclosure are not limited thereto. The number of first level nodes of the set of processing cores, LMSUs and NOC on a chip may be set as appropriate according to design specifications.

Specifically, each of the processing core groups includes at least one processing core, and any one of the

processing core groups

11, 12, 13, and 14 is connected to the other processing core groups of the

processing core groups

11, 12, 13, and 14 through the NOC. For example, the processing core group 11 may be connected to the

processing core groups

12, 13, and 14 through a NOC, the processing core group 12 may be connected to the

processing core groups

11, 13, and 14 through a NOC, the processing core group 13 may be connected to the

processing core groups

11, 12, and 14 through a NOC, and the processing core group 14 may be connected to the

processing core groups

11, 12, and 13 through a NOC. Further, processing core group 11 may be connected to processing core group 12 through NOC00, NOC20, and NOC01, to processing core group 23 through NOC00, NOC20, and NOC02, and to processing core group 13 through NOC00, NOC20, and NOC 02.

A plurality of

LMSUs

21, 22, 23, and 24 are in one-to-one correspondence with a plurality of

processing core groups

11, 12, 13, and 14, and each LMSU may be connected to all processing cores within the corresponding processing core group through a first level node of the NOC. For example, LMSU 21 corresponds to processing core group 11, LMSU 22 corresponds to processing core group 12, LMSU 23 corresponds to processing core group 13, and LMSU24 corresponds to processing core group 14. Further, LMSU 21 may be connected to group of processing cores 11 through a first level node NOC00, LMSU 22 may be connected to group of processing cores 12 through a first level node NOC01, LMSU 22 may be connected to group of processing cores 13 through a first level node NOC02, and LMSU24 may be connected to group of processing cores 14 through a first level node NOC 03.

Further, a broadcast station may be connected to a second level node of the NOC and to a plurality of

LMSUs

21, 22, 23 and 24, respectively. For example, the broadcast station may be connected to a second level node NOC20 and may be directly connected to each LMSU 21, 22, 23 and 24 without going through the NOC. According to embodiments of the present disclosure, a broadcasting station may be implemented with a logic unit.

By arranging the LMSU and the broadcasting station in the chip, the amount of the dedicated memory in the chip is reduced, so that the chip area can be saved and the cost can be reduced. Meanwhile, since a plurality of LMSUs correspond to a plurality of processing core groups one to one in a chip, the chip can be easily expanded.

According to an embodiment of the present disclosure, each LMSU may include a Receive (RX) directory, a transmit enable CHAIN (TX _ EN _ CHAIN), a receive enable CHAIN (RX _ EN _ CHAIN), a local Memory (Memory), and a monitor queue (Snoop queue).

The RX directory is used to record data indexes of the processing cores in the corresponding processing core group as receiving cores. The data index of the processing core may include a data address of data to be received by a processing core serving as a receiving core in the corresponding processing core group; optionally, the data index may further include the number of data blocks to be received and the size of each data block to be received.

The transmission enabling chain is used for configuring each receiving core in the transmission by the processing core serving as the transmitting core in the corresponding processing core group; the receive enable chain configures its own operating state for a processing core in the corresponding processing core group as a receive core. The number of bits of the transmit enable chain and the receive enable chain may be determined according to the number of processing cores included in the group of processing cores corresponding to the LMSU. For example, when the number of processing cores included in the processing core group is 4, the number of bits of the transmit enable chain and the receive enable chain in the LMSU may be 4 bits. Alternatively, the transmit enable chains of

multiple LMSUs

21, 22, 23, and 24 may be concatenated together and mapped to a broadcast station. In this way, the processing core serving as the sending core may configure each receiving core in the transmission at a time, and it can be understood that, when configuring each receiving core in the transmission at a time for the sending core, if each receiving core is located in three LMSUs respectively, the length of the transmit enable chain is the total length of the transmit enable chain in the three LMSUs, and is total 12 bits.

The local memory is used for storing data. For example, local memory may be volatile memory (such as dynamic ram (dram) or static ram (sram)) or non-volatile memory (such as flash memory, erasable programmable read-only memory (EPROM), and/or electrically EPROM (eeprom)).

The monitoring queue is used for monitoring a communication port between the LMSU and the broadcast station. For example, the monitoring queue may monitor a state of a communication port between the LSMU and the broadcast station, and determine, according to the configuration of the monitoring queue, whether data sent to the monitoring queue belongs to data to be received by an LMSU in which the monitoring queue is located (that is, data to be received by a receiving core in a processing core group corresponding to the LMSU).

The configuration of the RX directory specifically includes: when data is sent to the LMSU, but the data is saved to the local memory due to some reason, the LMSU may configure the RX directory to write, in the RX directory, a data index of the data to be received by the receiving core in the local memory, for example, the data index includes an address of the data to be received, and optionally, the number of data blocks to be received and/or the size of each data block to be received.

For the configuration of the transmission enable chain, for example, specifically: assume that each processing core group includes 4 processing cores. When a first processing core in the processing core group 11 is ready to send data to a second processing core of the processing core group 12, a third processing core of the processing core group 13, and a first processing core of the processing core group 14, the first processing core in the processing core group 11 configures a transmission enable chain, and sets bits corresponding to respective processing cores in the transmission enable chain to preset values, that is: setting a second bit in a transmission enable chain of an LMSU 22 connected with the processing core group 12, a third bit in a transmission enable chain of an LMSU 23 connected with the processing core group 13, and a first bit in a transmission enable chain of an LMSU24 connected with the processing core group 14 as preset values; and the first processing core in the processing core group 11 sends the configured transmission enable chain to the broadcast station, the broadcast station broadcasts the configured transmission enable chain to each LMSU, and the LMSU sets the corresponding bit of the internal transmission enable chain to a preset value according to the transmission enable chain, thereby completing the configuration of the receiving core for this transmission.

For the configuration of the receive enable chain, for example, specifically: when the second processing core of the processing core group 12, the third processing core of the processing core group 13, and the first processing core of the processing core group 14 are ready to receive, the second processing core of the processing core group 12, the third processing core of the processing core group 13, and the first processing core of the processing core group 14 may configure the second bit in the receive enable chain of the LMSU 22, the third bit in the receive enable chain of the LMSU 23, and the first bit in the receive enable chain of the LMSU24, respectively, and configure corresponding bits to preset values to indicate that the processing cores corresponding to these bits are ready to receive.

For the monitoring queue, for example, when the LMSU receives configuration information broadcasted by the broadcast station, or when one processing core is ready to receive but does not find data to be received, the LMSU may configure the monitoring queue to include a data index of the data to be received in the monitoring queue, so that the monitoring queue may monitor the data transmitted from the broadcast station, and if the data is consistent with the data index of the data to be received in the monitoring queue, the LMSU receives the data. The data index of the data to be received comprises: an address of data to be received; optionally, the data index of the data to be received may further include the number of data blocks of the data to be received and/or the size of each data block.

By setting one LMSU for each processing core group and setting the RX directory, the transmission enabling chain, the reception enabling chain and the monitoring queue in the LMSU, proper scheduling can be realized in inter-core communication, so that congestion of data transmission in a chip can be effectively avoided, and transmission delay in inter-core communication is reduced.

How inter-core communication is implemented on the chip will be described in detail below with reference to fig. 1. According to the embodiment of the disclosure, the inter-core communication can be divided into two cases, namely single (processing) core transmission, single (processing) core reception and single core transmission multi-core reception.

First, a case where the single core transmits the single core reception is described. Assume that a first processing core in the processing core group 11 is a sending core and a second processing core in the processing core group 12 is a receiving core.

When the sending core is ready to send data to the receiving core, the sending core may access the LMSU 22 through the NOC and determine whether the receiving core is ready to receive based on parameters in the LMSU 22. If the receiving core is ready to receive, the transmitting core may transmit data to the receiving core through the NOC, the broadcast station, and the LMSU 22. If the receiving core is not ready to receive, the sending core may send the data to the LMSU 22 for storage through the NOC.

Specifically, when the sending core is ready to send data to the receiving core (e.g., when the sending core is executing instructions to send data to the receiving core), the sending core determines that the receiving core for this data transmission is a processing core, i.e., a single core, then the sending core accesses the receive enable chain of the LMSU 22 through the NOC (e.g., through NOC00, NOC20, and NOC01) to determine whether the receiving core is ready to receive. The preparation of the receiving core is specifically as follows: when the receiving core runs to an instruction that requires data to be obtained, the receiving core may set the second bit of the receive enable chain of the LMSU 22 to 1 (e.g., set to 1 indicates ready, set to 0 indicates not ready) to indicate that the receiving core is ready to receive. And when the receiving core is ready to receive, the receiving core also configures a monitoring queue in the LMSU to generate a data index of the data to be received, which is matched with the data required by the data acquisition instruction, in the monitoring queue.

When the transmitting core accesses the receive enable chain of the LMSU 22 through the NOC, determining that the receiving core is ready to receive, the transmitting core transmits data to the broadcast station through the NOC (e.g., through NOC00 and NOC 20). The broadcast station may send the received data directly (without passing through the NOC) to the LMSU 22, and the LMSU 22 determines whether the data sent by the broadcast station matches the data index of the data to be received in the monitoring queue and, if so, forwards the received data to the receiving core by the LMSU 22 (e.g., forwards the received data to the receiving core through the NOC01) without storing the received data in local memory. If not, the data is discarded or not processed.

Additionally, if the sending core accesses the receive enable chain of the LMSU 22 through the NOC, determines that the receiving core is not ready to receive, i.e., the receiving core has not executed instructions to acquire data, the sending core sends the data to the LMSU 22 through the NOC (e.g., through NOC00 and NOC20) and saves in local memory. At this time, the LMSU 22 updates the RX directory, and writes the received information such as the data address of the data in the local memory into the RX directory. When the receiving core runs to the instruction needing to acquire data, the receiving core firstly checks the RX directory of the LMSU 22 and judges whether the content of the RX directory is matched with the data needed in the instruction needing to acquire data. If there is a match, the receiving core may read the data directly from the LMSU 22; if not, the receiving core configures the receive enable chain of the LMSU 22 to indicate that the receiving core is ready to receive; meanwhile, the receiving core configures a monitoring queue of the LMSU 22 to monitor whether data transmitted by the broadcasting station matches with a data index of data to be received in the monitoring queue.

The following describes the case of single-core transmission and multi-core reception.

Assume that the first processing core in the processing core group 11 is a sending core, and the first processing core, the second processing core, and the third processing core in the processing core group 12, the third processing core in the processing core group 13, and the second processing core, the third processing core, and the fourth processing core in the processing core group 14 are receiving cores.

When a sending core wants to send data to a plurality of receiving cores, the sending core sends configuration information to a plurality of LMSUs through the NOC and the broadcasting station. Upon receiving a configuration complete message sent by a broadcast station, a sending core may send data through the NOC and the broadcast station to a plurality of LMSUs corresponding to the data.

The configuration information comprises a transmission enabling chain and data information to be transmitted. The sending enabling chain is used for configuring each processing core which is used as a receiving core in the current transmission in each LMSU, and the data information to be sent is used for configuring a monitoring queue in each LMSU so as to monitor whether the data sent from the broadcasting station to the LMSU belongs to the data to be received of the LMSU, namely whether the data corresponds to the LMSU.

Any LMSU determines the condition of a receiving core in a processing core group connected with the LMSU according to the configuration information of the sending core; if there are multiple processing cores serving as receiving cores and at least one of the multiple receiving cores is ready to receive, the LMSU may send the data sent by the sending core to any one of the receiving cores ready to receive, and store the data sent by the sending core (i.e., store the data sent by the sending core in the local memory).

If there are multiple processing cores as receiving cores and none of the multiple receiving cores is ready for receiving, the LMSU may store the data sent by the sending core (i.e., store the data sent by the sending core in the local memory).

Specifically, when a sending core intends to send data to multiple receiving cores, the sending core first configures a sending enable chain, and since the sending enable chains of the LMSUs are connected in series, the sending core may set bits corresponding to all processing cores serving as receiving cores in the sending enable chain to preset values by configuring the sending enable chain to identify the processing cores serving as receiving cores in the current transmission, where the preset value is, for example, 1, and indicates that the processing core corresponding to the bit having the preset value in the sending enable chain is a receiving core in the current transmission. The transmitting core transmits the configured transmission enabling chain and the data information to be transmitted to the broadcasting station through the NOC, and the broadcasting station broadcasts and transmits the enabling chain core and the data information to be transmitted to each LMSU; each LMSU configures each internal processing core serving as the current receiving core through the received sending enabling chain, and configures an internal monitoring queue through data information to be sent so as to monitor whether data broadcasted from a broadcasting station to the LMSU belongs to the data to be received of the LMSU; after completing the configuration, the LMSU sends a configuration completed message to the broadcast station; upon receiving a configuration complete message sent by the broadcast station, the sending core may send data to the broadcast station via the NOC, which then broadcasts the data to the individual LMSUs.

Optionally, the broadcast station broadcasts the transmission enable chain core and the data information to be sent to each LMSU, and specifically, the broadcast station may send the transmission enable chain and the data information to be sent to each LMSU having a receiving core indicated in the transmission enable chain, so as to save power consumption.

It should be noted that broadcast and transmission by the broadcast station herein means that the broadcast station transmits data to the LMSU, and there is no substantial difference.

For example, when the transmitting core intends to transmit data, the transmit enable chain is first configured, and the first bit, the second bit, and the third bit (i.e., the first processing core, the second processing core, and the third processing core in the processing core group 12) corresponding to the transmit enable chain of the LMSU 22, the third bit (i.e., the third processing core in the processing core group 13) of the LMSU 23, and the second bit, the third bit, and the fourth bit (i.e., the second processing core, the third processing core, and the third processing core in the processing core group 14) of the LMSU24 are set to preset values to configure these processing cores as receiving cores of this data. The sending core also sends data information of data to be sent to the broadcast station, wherein the data information of the data to be sent includes a data address of the data to be sent, and the data information of the data to be sent may also include the number of data blocks of the data to be sent, the size of each data block, and the like; the broadcast station broadcasts the data information of the data to be sent to each LMSU, so that each LMSU writes the received information of the data to be sent into each internal monitoring queue, and the LMSU monitors whether the data sent from the broadcast station belongs to the data to be received by the LMSU.

The sending core sends the configured sending enabling chain and the data information of the data to be sent to the broadcasting station, and the broadcasting station broadcasts the configuration information to each LMSU; after the LMSU 22, LMSU 23, and LMSU24 configure the processing cores and the monitoring queues as the receiving cores according to the configuration information, the LMSU 22, LMSU 23, and LMSU24 send a configuration completion message to the broadcasting station, and the broadcasting station may send the configuration completion message to the sending core through the NOC. Upon receiving the configuration complete message, the sending core may send the data through NOC00, NOC20 to the broadcast station, which sends the data to LMSU 22, LMSU 23, and LMSU 24.

When the LMSU 22 receives data sent by the sending core from the broadcasting station, the monitoring queue in the LMSU 22 compares the received data, and if the received data is consistent with the data index of the data to be received in the monitoring queue, the LMSU 22 receives the data; the LMSU 22 determines whether the first, second, and third processing cores configured as the receiving cores are ready to receive based on the receive enable chain. For example: if the first bit and the second bit in the receive enable chain are preset values (a preset value, for example, 1, indicates that the first processing core and the second processing core are ready to receive, and a preset value, for example, 0, indicates that the preset value is ready to receive), the LMSU 22 may send the received data to any receiving core that is ready to receive, for example, the first processing core, and store the received data in the local memory, and simultaneously update the RX directory, so as to write the data information of the data stored in the local memory into the RX directory, where the data information includes information, for example, a storage address of the data, so that the second processing core may directly obtain the data sent by the sending core from the local memory of the LMSU 22 according to the RX directory. In addition, when the third processing core is ready to receive, the third processing core may also obtain the data sent by the sending core from the local memory of the LMSU 22.

If only one of the first processing core, the second processing core and the third processing core in the processing core group 12 is ready for reception, the LMSU 22 may send the received data to the receiving core ready for reception, store the received data in the local memory, and update the RX directory at the same time to write the data information of the data stored in the local memory into the RX directory, so that the second processing core and the third processing core can directly obtain the data sent by the sending core from the local memory of the LMSU 22 according to the RX directory after being ready.

If none of the first, second, and third processing cores in the set of processing cores 12 are ready to receive, the LMSU 22 may store the received data in the local memory while updating the RX directory to write data information for the data stored in the local memory to the RX directory. In this way, when any one of the first processing core, the second processing core, and/or the third processing core is ready to receive, the data transmitted by the transmitting core may be retrieved from the local memory of the LMSU 22 according to the RX directory.

In addition, the processing of LMSU 23, LMSU24 is similar to that of LMSU 22. And will not be described in detail herein.

The operation of the receiving core is described below with the first, second, and third processing cores in the processing core group 12 as an example.

When the first processing core runs to an instruction that needs to obtain data, it is first checked whether a data index indicating the first processing core as a receiving core exists in the RX directory of the LMSU 22. As described above, if the local memory of the LMSU 22 stores data, index information such as a received data address is stored in the RX directory. In this case, the RX list would include the address of the received data. Thus, when the address of the data received in the RX directory of the LMSU 22 coincides with the address of the data to be received by the first processing core, then the first processing core may read the data that the transmitting core has transmitted from the local memory of the LSMU 22. Alternatively, the second processing core and the third processing core may perform similar operations as the first processing core when they run instructions that require data to be fetched.

However, when any one of the first, second, and third processing cores runs an instruction to acquire data, for example, when the first processing core runs an instruction to acquire data, the first processing core may configure the monitoring queue of the LMSU 22 if the first processing core does not exist in the RX directory of the LMSU 22 as a data index of the receiving core (i.e., the transmitting core has not yet transmitted data to the LSMU 22). For example, the first processing core may configure the monitoring queue of the LMSU 22 according to a data index of the data to be received, for example, write the data index of the data to be received by the first processing core into the monitoring queue, where the data index includes a data address of the data to be received, and may further include the number of data blocks of the data to be received and the size of each data block. At the same time, the first processing core will set the corresponding first bit in the receive enable chain of the LMSU 22 to a preset value to indicate that it is ready to receive. The execution process of the second processing core and the third processing core is the same as the execution process of the first processing core, and is not described herein again.

When the first processing core is ready, the process of sending data by the sending core is as described above, that is: firstly, sending a configured sending enabling chain and data information to be sent to a broadcasting station, then sending the received configuration information to each LMSU by the broadcasting station, configuring a monitoring queue and a receiving core by each LMSU, feeding back the configured information to the broadcasting station after configuration is completed, sending the configured information to a sending core by the broadcasting station, and sending the data by the sending core at the moment. The sending core sends data to the LMSU 22 through the NOC and the broadcasting station, the LMSU 22 compares the data index of the monitoring queue with the data, if the data index is consistent with the data, the LMSU receives the data, and then the sending core directly sends the data to the first processing core according to the receiving enabling chain to determine that the first processing core is ready for receiving; meanwhile, the LMSU determines that a second processing core and a third processing core except the first processing core are ready to receive according to the receive enable chain, the LMSU stores the data in the local memory, updates the RX directory, writes the stored data index into the RX directory, and the second processing core and the third processing core can directly read the stored data from the local memory.

According to the embodiment of the disclosure, the RX directory, the transmission enabling chain, the reception enabling chain and the monitoring queue in the LMSU can be flexibly configured, so that the single-core transmission and single-core reception is compatible with the single-core transmission and multi-core reception, therefore, the efficiency of inter-core communication can be improved, and the complexity of a communication network can be reduced.

An inter-core communication method according to an embodiment of the present disclosure is described below with reference to fig. 2 to 4.

Fig. 2 is a flowchart illustrating an inter-core communication method according to one embodiment of the present disclosure. The inter-core communication method is applicable to the chip described with reference to fig. 1, and the chip may include a plurality of processing core groups, each of which may include at least one processing core and may be connected to one LMSU through the NOC.

Referring to fig. 2, in step S201, one processing core, which is a transmitting core, may access an LMSU corresponding to another processing core, which is a receiving core, through the NOC.

In step S202, the sending core may determine whether the receiving core is ready to receive according to the parameters in the corresponding LMSU.

In step S203, in response to the receiving core being ready for reception, the transmitting core may transmit data to the receiving core through the NOC, the broadcast station, and the corresponding LMSU.

As described above, a broadcast station may be connected to a NOC (e.g., NOC20) and all LMSUs, respectively.

Alternatively, in step S204, the sending core may send data to the corresponding LMSU through the NOC in response to the receiving core not being ready to receive.

The detailed process of this embodiment refers to the description of the foregoing embodiments, and is not repeated herein.

Fig. 3 is a flowchart illustrating an inter-core communication method according to another embodiment of the present disclosure. As described above, the inter-core communication method is applicable to the chip described with reference to fig. 1, and the chip may include a plurality of processing core groups, each of which may include at least one processing core and may be connected to one LMSU through the NOC.

Referring to fig. 3, in step S301, a processing core, which is a transmitting core, transmits configuration information to the LMSU via the NOC and a broadcast station, which is connected to a second level node (e.g., NOC20) of the NOC and all LMSUs, respectively, as described above.

In step S302, in response to receiving the configuration complete message transmitted by the broadcast station, the transmitting core transmits data to all LMSUs through the first level nodes of the NOC (e.g., NOC00, NOC01, NOC02, and NOC03) and the broadcast station.

After step S302, the inter-core communication method may further include: and the first LMSU determines that a plurality of processing cores which are used as receiving cores exist in a processing core group connected with the first LMSU according to the configuration information. If the first LMSU determines that at least one receiving core in the plurality of receiving cores is ready for receiving, the first LMSU sends the data to any receiving core which is ready for receiving, and stores the data; or if the first LMSU determines that the receiving cores are not ready for receiving, the first LMSU saves the data, wherein the first LMSU can be any one of all LMSUs.

Fig. 4 is a flowchart illustrating an inter-core communication method according to still another embodiment of the present disclosure. As described above, the inter-core communication method is applicable to the chip described with reference to fig. 1, and the chip may include a plurality of processing core groups, each of which may include at least one processing core and may be connected to one LMSU through the NOC.

Referring to fig. 4, in step S401, in response to an instruction executed to receive data, a processing core queries an LMSU corresponding thereto. In step S402, in response to querying a record in the corresponding LMSU indicating that the processing core is a receiving core, the processing core may read data from the corresponding LMSU.

Optionally, in step S403, in response to not querying the record indicating the processing core as the receiving core in the corresponding LMSU, the processing core configures the corresponding LMSU to generate a record indicating the processing core as the receiving core in the corresponding LMSU, where the record is used for the corresponding LMSU to monitor the data sent by the broadcast station.

According to the embodiment of the disclosure, after the processing core configures the corresponding LMSU to generate a record indicating that the processing core is a receiving core in the corresponding LMSU, in step S404, in response to the corresponding LMSU monitoring data sent by the broadcast station, the processing core may acquire the data sent by the broadcast station from the corresponding LMSU.

Referring to fig. 5, the electronic device 500 may include a processor 510 and a memory 520. Processor 510 may include, but is not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Neural Processing Unit (NPU), an Application Processor (AP), a Digital Signal Processor (DSP), a system on a chip (SoC), a microprocessor, and the like. The processor may be implemented by a chip according to an embodiment of the disclosure. The inter-core communication method according to the embodiments of the present disclosure may be implemented in the chip. The memory 520 stores computer programs to be executed by the processor 510. Memory 520 includes high speed random access memory and/or non-volatile computer-readable storage media.

Optionally, the electronic device 500 may further include a display device 525, a storage device 530, an input device 540, an output device 550, and a communication device 560. The various elements of the electronic device 500 described above may communicate with each other via a communication bus. The display device 525 may display various user interfaces and/or application interfaces. Storage 530 includes computer-readable storage media. Storage 530 stores a greater amount of information and for a longer period of time than memory 520. For example, storage 530 includes storage media such as hard disks, optical disks, and solid state drives. Input device 540 receives input from a user through tactile, video, audio, or touch input. For example, the input device 540 includes a keyboard, a mouse, a touch screen, a microphone, or any other device that detects input from a user and transmits the detected input to the electronic device 500. The output device 550 provides output of the electronic device 500 to a user through a visual, auditory, or tactile channel. Output device 550 includes, for example, a display, a touch screen, a speaker, a vibration generator, or any other device that provides output to a user. The communication unit 560 communicates with an external device through a wired or wireless network.

The inter-core communication method according to an embodiment of the present disclosure may be written as a computer program and stored on a computer-readable storage medium. The computer program, when executed by a processor, may implement the inter-core communication method as described above. Examples of computer-readable storage media include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disk memory, Hard Disk Drive (HDD), solid-state disk drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or an extreme digital (XD) card), tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic disk, a magnetic data storage device, a magnetic disk, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. In one example, the computer program and any associated data, data files, and data structures are distributed across networked computer systems such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

While the present disclosure has been particularly shown and described with reference to embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure as defined by the following claims.

Claims

1. A chip, wherein the chip comprises:

a plurality of processing core groups, each processing core group comprising at least one processing core, any one of the plurality of processing core groups being connected to others of the plurality of processing core groups via a network on chip NOC;

a plurality of local shared memory units (LMSUs) in one-to-one correspondence with the plurality of processing core groups, the LMSUs being connected to all processing cores in the corresponding processing core groups through a first level node of the NOC;

a broadcast station connected to a second level node of the NOC and the plurality of LMSUs, respectively.

2. The chip of claim 1, wherein the LMSU comprises:

a receiving directory for recording data indexes of the processing cores in the corresponding processing core group as receiving cores;

a sending enable chain, configured to configure each receiving core in the current transmission for the processing core serving as the sending core in the corresponding processing core group;

a receive enable chain, configured to configure a working state of each processing core in the corresponding processing core group as the receive core;

a local memory for storing data; and

and the monitoring queue is used for monitoring a communication port between the LMSU and the broadcasting station.

3. The chip of claim 2, wherein the transmit enable chains in a plurality of the LMSUs are concatenated and mapped to the broadcast station.

4. An inter-core communication method for a chip comprising a plurality of groups of processing cores, characterized in that said groups of processing cores comprise at least one processing core, each of said groups of processing cores being connected to a local shared memory unit LMSU via a network on chip NOC;

the inter-core communication method comprises the following steps:

one processing core serving as a sending core accesses an LMSU corresponding to the other processing core serving as a receiving core through the NOC;

determining whether the receiving core is ready to receive according to the parameters in the corresponding LMSU; and

in response to the receiving core being ready to receive, the transmitting core transmits data to the receiving core through the NOC, a broadcast station and the corresponding LMSU, wherein the broadcast station is connected to the NOC and all of the LMSUs, respectively.

5. The inter-core communication method of claim 4, further comprising:

in response to the receiving core not being ready to receive, the transmitting core transmits data to the corresponding LMSU through the NOC.

6. An inter-core communication method for a chip comprising a plurality of groups of processing cores, characterized in that said groups of processing cores comprise at least one processing core, each of said groups of processing cores being connected to a local shared memory unit LMSU via a network on chip NOC;

the inter-core communication method comprises the following steps:

a processing core as a sending core sends configuration information to the LMSU through a Network On Chip (NOC) and a broadcasting station, wherein the broadcasting station is respectively connected with a second-level node of the NOC and all the LMSUs;

in response to receiving a configuration complete message sent by the broadcast station, the sending core sends the data to each of the LMSUs through a first level node of the NOC and the broadcast station.

7. The inter-core communication method of claim 6,

the first LMSU determines that a plurality of processing cores which serve as receiving cores exist in a processing core group connected with the first LMSU according to the configuration information;

if the first LMSU determines that at least one receiving core in the plurality of receiving cores is ready for receiving, the first LMSU sends the data to any receiving core which is ready for receiving, and stores the data; or

If the first LMSU determines that the receiving cores are not ready for receiving, the first LMSU saves the data,

wherein the first LMSU is any one of all the LMSUs.

8. An inter-core communication method for a chip comprising a plurality of groups of processing cores, characterized in that said groups of processing cores comprise at least one processing core, each of said groups of processing cores being connected to a local shared memory unit LMSU via a network on chip NOC;

the inter-core communication method comprises the following steps:

responding to an instruction executed to receive data, and inquiring the LMSU corresponding to the first processing core by the first processing core;

and in response to querying a record in the LMSU indicating that the first processing core is a receiving core, the first processing core reads data from the LMSU.

9. The inter-core communication method of claim 8, wherein the inter-core communication method further comprises:

in response to not querying a record in the LMSU indicating that the first processing core is a receiving core, the first processing core configures the LMSU to generate a record in the LMSU indicating that the first processing core is ready to receive and a record indicating that the LMSU monitors data sent by the broadcast station.

10. The inter-core communication method of claim 9, wherein after the first processing core configures the LMSU to generate a record in the LMSU indicating that the first processing core is ready to receive, the inter-core communication method further comprises:

and responding to the data sent by the broadcast station monitored by the LMSU, and the first processing core acquires the data sent by the broadcast station from the LMSU.