NZ716954B2 - Computing architecture with peripherals - Google Patents
Computing architecture with peripherals Download PDFInfo
- Publication number
- NZ716954B2 NZ716954B2 NZ716954A NZ71695414A NZ716954B2 NZ 716954 B2 NZ716954 B2 NZ 716954B2 NZ 716954 A NZ716954 A NZ 716954A NZ 71695414 A NZ71695414 A NZ 71695414A NZ 716954 B2 NZ716954 B2 NZ 716954B2
- Authority
- NZ
- New Zealand
- Prior art keywords
- interconnect
- memory transfer
- memory
- port
- master
- Prior art date
Links
- 230000002093 peripheral Effects 0.000 title description 78
- 230000004044 response Effects 0.000 abstract description 119
- 230000000875 corresponding Effects 0.000 abstract description 44
- 238000000034 method Methods 0.000 description 37
- 238000010586 diagram Methods 0.000 description 23
- 230000000903 blocking Effects 0.000 description 22
- 238000004458 analytical method Methods 0.000 description 19
- 230000032258 transport Effects 0.000 description 18
- UIIMBOGNXHQVGW-UHFFFAOYSA-M buffer Substances [Na+].OC([O-])=O UIIMBOGNXHQVGW-UHFFFAOYSA-M 0.000 description 17
- 230000003111 delayed Effects 0.000 description 15
- 230000002457 bidirectional Effects 0.000 description 14
- 238000005192 partition Methods 0.000 description 14
- 150000002500 ions Chemical class 0.000 description 8
- 230000002708 enhancing Effects 0.000 description 7
- 230000003068 static Effects 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- LHVJUPHCLWIPLZ-UHFFFAOYSA-N 3-acetyloxy-2-methylbenzoic acid Chemical compound CC(=O)OC1=CC=CC(C(O)=O)=C1C LHVJUPHCLWIPLZ-UHFFFAOYSA-N 0.000 description 5
- 230000001934 delay Effects 0.000 description 5
- 239000002131 composite material Substances 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000003287 optical Effects 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 3
- 241000229754 Iva xanthiifolia Species 0.000 description 2
- 229940035295 Ting Drugs 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004377 microelectronic Methods 0.000 description 2
- 230000001960 triggered Effects 0.000 description 2
- NLZUEZXRPGMBCV-UHFFFAOYSA-N Butylhydroxytoluene Chemical compound CC1=CC(C(C)(C)C)=C(O)C(C(C)(C)C)=C1 NLZUEZXRPGMBCV-UHFFFAOYSA-N 0.000 description 1
- 235000007575 Calluna vulgaris Nutrition 0.000 description 1
- 241001522296 Erithacus rubecula Species 0.000 description 1
- 241000353097 Molva molva Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 150000001768 cations Chemical class 0.000 description 1
- 230000001427 coherent Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000873 masking Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006011 modification reaction Methods 0.000 description 1
- 239000010979 ruby Substances 0.000 description 1
- 229910001750 ruby Inorganic materials 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000002459 sustained Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0817—Cache consistency protocols using directory methods
- G06F12/082—Associative directories
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0831—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0864—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1605—Handling requests for interconnection or transfer for access to memory bus based on arbitration
- G06F13/1652—Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
- G06F13/1663—Access to shared memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/36—Handling requests for interconnection or transfer for access to common bus or bus system
- G06F13/362—Handling requests for interconnection or transfer for access to common bus or bus system with centralised access control
- G06F13/364—Handling requests for interconnection or transfer for access to common bus or bus system with centralised access control using independent requests or grants, e.g. using separated request and grant lines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/36—Handling requests for interconnection or transfer for access to common bus or bus system
- G06F13/368—Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control
- G06F13/372—Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control using a time-dependent priority, e.g. individually loaded time counters or time slot
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/40—Bus structure
- G06F13/4063—Device-to-bus coupling
- G06F13/4068—Electrical coupling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
- G06F13/4282—Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/28—Using a specific disk cache architecture
- G06F2212/283—Plural cache memories
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
- G06F2212/6032—Way prediction in set-associative cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/62—Details of cache specific to multiprocessor cache arrangements
- G06F2212/621—Coherency control relating to peripheral accessing, e.g. from DMA or I/O device
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0619—Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/065—Replication mechanisms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1072—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers for memories with random access ports synchronised on clock signal pulse trains, e.g. synchronous memories, self timed memories
Abstract
shared memory computing architecture (300) has M interconnect masters (350, 351, 352, 353, 354), one interconnect target (370), and a timeslot based interconnect (319). The interconnect (319) has a unidirectional timeslot based interconnect (320) to transport memory transfer requests with T timeslots and a unidirectional timeslot based interconnect (340) to transport memory transfer responses with R timeslots. For each of the R timeslots, that timeslot: corresponds to one memory transfer request timeslot and starts at least L clock cycles after the start time of that corresponding memory request timeslot. The value of L is >= 3 and < T. Interconnect target (370) is connected to interconnect (319). Each interconnect master (350, 351, 352, 353, 354) is connected to interconnect (319). ots and a unidirectional timeslot based interconnect (340) to transport memory transfer responses with R timeslots. For each of the R timeslots, that timeslot: corresponds to one memory transfer request timeslot and starts at least L clock cycles after the start time of that corresponding memory request timeslot. The value of L is >= 3 and < T. Interconnect target (370) is connected to interconnect (319). Each interconnect master (350, 351, 352, 353, 354) is connected to interconnect (319).
Description
June 2020 1 NZ IP No. 716954
COMPUTING ARCHITECTURE WITH PERIPHERALS
Field of the invention
The present invention relates to multi interconnect master computing architectures and is
particularly applicable to real-time and mixed-criticality computing involving peripherals.
Background of the invention
Throughout this specification, including the claims:
a bus master is a type of interconnect master;
a bus target / slave is a type of an onnect target;
a memory store coupled with a memory controller may be described at a higher level of
abstraction as a memory store;
a peripheral may or may not have I/O pins;
a peripheral is connected to an interconnect that transports memory transfer requests;
a eral may be memory mapped, such that a memory transfer t to the
interconnect target port of a peripheral is used to control that peripheral;
a processor core may be remotely connected to an interconnect over a bridge; and
a definition and description of domino timing s can be found in [1].
Many shared memory computing devices with multiple bus-masters / interconnect-masters, such
as the European Space Agencies’ Next tion rocessor architecture [3] experience
severe real-time problems [4]. For e, the memory transfer requests of software running
on one core of the NGMP architecture experiences ed timing interference from unrelated
memory transfer requests issued by other bus masters [4] over the shared ARM AMBA AHB [2]
interconnect. For example, unwanted timing interference can occur by memory transfer requests
issued by other cores and bus master peripherals to the level 2 cache module and SDRAM. Even
though most memory transfer requests are in practice at most 32-bytes in length, a single
memory er request can block the bus from servicing other memory transfer requests for
more than 10 clock .
Summary of the invention
In contrast, in one aspect, embodiments of the present invention provide a shared memory
computing device comprising:
a first clock;
at least M interconnect masters, where the value of M is 4;
at least 1 interconnect target;
Signed B. Gittins
14 June 2020 2 NZ IP No. 716954
a first ot based interconnect for transporting memory transfer requests and their
corresponding responses, comprising:
an input clock port that is connected to the first clock;
a unidirectional timeslot based interconnect to transport memory transfer requests
with T timeslots, where the value of T is at least 4;
a unidirectional timeslot based interconnect to transport memory transfer
responses with R ots, in which:
for each of the R timeslots that timeslot:
corresponds to one memory transfer request timeslot; and
starts at least L clock cycles after the start time of that
corresponding memory request ot, where the value of L is at
least 3 and less than the value of T;
in which:
at least one interconnect target is connected to the first timeslot based
interconnect; and
for each interconnect master I of the M interconnect masters:
each interconnect master I is connected to the first ot based
interconnect; and
each of the T ots is mappable to a different one of the M
interconnect masters.
A shared memory computing device optimised for upper-bound worst case execution time
analysis comprising:
an on-chip random access memory store comprising at least two interconnect target ports,
in which:
the first target port:
has a data path of D-bits in width, the value of D being larger than or
equal to 2;
is d to sustain a throughput of one D-bit wide memory transfer
request per clock cycle; and
is adapted to sustain a throughput of one D-bit wide memory transfer
response per clock cycle; and
the second target port:
has a data path of E-bits in width, the value of E being larger than or equal
to 1;
is adapted to sustain a throughput of one E-bit wide memory transfer
Signed B. Gittins
14 June 2020 3 NZ IP No. 716954
request per clock cycle; and
is adapted to sustain a throughput of one E-bit wide memory transfer
response per clock cycle;
a first on-chip shared memory interconnect which:
has a data path of D-bits in width;
is exclusively connected to the first port of the at least two interconnect target
ports of the on-chip random access memory;
is d to sustain a throughput of one D-bit wide memory transfer request per
clock cycle to the on-chip random access memory;
is adapted to sustain a throughput of one D-bit wide memory transfer response per
clock cycle; and
has at least two cache modules connected to it, each cache module comprising:
a master port with a D-bit wide data path which is connected to this
interconnect; and
a target port;
and a second on-chip shared memory interconnect which:
has a data path of E-bits in width;
is exclusively ted to the second port of the at least two interconnect target
ports of the on-chip random access memory;
is adapted to sustain a peak throughput of one E-bit wide memory transfer request
per clock cycle to the on-chip random access memory;
is adapted to sustain a peak throughput of one E-bit wide memory transfer
response per clock cycle; and
has at least two interconnect masters connected to it.
A shared memory ing device comprising:
a first system interconnect;
an p random access memory store comprising at least one interconnect target port,
in which the first interconnect target port is connected to the first system interconnect;
at least one sub-computing device, each sub-computing device comprising:
a first local interconnect;
a first interconnect master connected to a local onnect of the sub-computing
device;
an interconnect bridge sing two ports, in which:
the first port is connected to the first system onnect; and
the second port is connected to a local interconnect of the sub-computing
Signed B. Gittins
14 June 2020 4 NZ IP No. 716954
device; and
in which the first interconnect master is adapted to issue memory transfer ts
to the on-chip random access memory store; and
a first peripheral, comprising:
a first interconnect target port which is ted to the first local interconnect of
the first of the at least one sub-computing devices;
a first interconnect master port which is adapted to issue memory transfer requests
to the on-chip random access memory store;
in which:
the first interconnect master of the first of the at least one sub-computing devices
is adapted to issue memory transfer ts to the first peripheral.
A shared memory computing device sing:
M interconnect-masters, where the value of M is at least 2, each interconnect-master
comprising:
an egress port; and
an ingress port; and
a first timeslot based onnect for transporting memory transfer requests and their
ponding responses, comprising:
an arbiter and decoder module;
a M-to-1 multiplexer, comprising:
a select port;
M data input ports; and
1 data output port;
and a 1-to-M demultiplexer, comprising:
a select port;
1 data input port; and
M data output ports;
in which:
for each interconnect master I:
the egress port of interconnect master I is connected to the data input port I of the
M-to-1 multiplexer; and
the ingress port of interconnect master I is connected to the data output port I of
the 1-to-M demultiplexer;
the arbiter and decoder module of the interconnect controls the value supplied to the
select port of the M-to-1 multiplexer; and
Signed B. Gittins
14 June 2020 5 NZ IP No. 716954
the value ed to the select port of the 1-to-M demultiplexer is the value supplied to
the select port of the M-to-1 multiplexer delayed by L clock cycles, where the value of L
is larger or equal to 3.
A shared memory computing device comprising:
M interconnect-nodes, where the value of M is at least 2, each interconnect-node
comprising:
an egress port; and
an ingress port;
a singular interconnect node comprising:
an egress port; and
an ingress port;
a first Mx1 interconnect for transporting memory transfer requests and their
corresponding responses, comprising:
M bidirectional ports, each comprising:
an s port which is connected to the egress port of a different one of
the M onnect-nodes; and
an egress port, which is connected to the ingress port of a different one of
the M interconnect-nodes;
a singular bidirectional port comprising:
an egress port which is connected to the ingress port of the ar
interconnect node; and
an ingress port which is connected to the egress port of the singular
interconnect node;
a el-in, serial-out (PISO) M input port x 1 output port shift register with M
stages, in which:
for each stage I of the M stages: that stage is connected to the egress port
of the interconnect node I of M interconnect nodes; and
the output of stage 1 is ted to the egress port of the singular port of
the interconnect;
a serial-in, parallel-out (SIPO) 1 input port x M output port module, in which the
input is connected to the ingress port of the singular port of the interconnect; and
an arbiter and decoder module which is adapted to control the PISO Mx1 shift
register and the SIPO 1xM module.
A shared memory computing device optimised for worst case ion time analysis
Signed B. Gittins
14 June 2020 6 NZ IP No. 716954
comprising:
N fully associative cache modules, where the value of N is at least 1, each fully
associative cache module comprising:
a master port:
a target port;
a means to track dirty cache-lines;
a finite state machine with one or more policies, in which at least one policy:
employs an allocate on read strategy;
employs an allocate on write strategy; and
employs a least recently used eviction strategy; and
N processor cores, in which each core is assigned a different one of the N fully
associative cache modules as its e cache.
A shared memory computing device optimised for worst case execution time analysis
comprising:
at least one interconnect master;
N cache modules, where the value of N is at least 1, each cache module comprising:
a master port:
a target port; and
a finite state e that s an update-type cache coherency policy;
N processor cores, in which each core:
is assigned a different one of the N fully associative cache modules as its private
cache; and
in which the execution time of memory transfer requests issued by each of the N
processor cores is are not modified by:
the ted memory transfer requests issued by any of the other N processor
cores; or
the unrelated memory transfer requests issued by at least one other interconnect
A bidirectional interconnect for transporting memory transfer requests and their corresponding
memory er responses, comprising:
a unidirectional interconnect to transport memory transfer requests; and
a unidirectional interconnect to transport memory transfer responses, adapted to transport
memory transport responses that includes a copy of the ponding memory er
request.
Signed B. Gittins
14 June 2020 7 NZ IP No. 716954
Further inventive aspects of the present invention are set out in the claims appearing at the end of
this specification.
Brief description of the drawings
For a better understanding of the invention, and to show how it may be carried into effect,
embodiments of it are shown, by way of non-limiting example only, in the accompanying
drawings. In the gs:
figure 1 is a block schematic diagram illustrating preferred embodiments of the present
invention;
figure 2 is a flow-chart illustrating processes according to the embodiments of figure 1;
figure 3 is a block schematic diagram preferred embodiments of the present invention;
figure 4 is a flow-chart illustrating processes according to the embodiments of figure 3;
figure 5 is a timing diagram illustrating timing according to the embodiments of figure;
figure 6 is a block schematic diagram illustrating preferred embodiments of the present
invention;
figures 7 and 8 are timeslot scheduling diagrams according to embodiments of the type of
figure 3
figure 9 is an access control list diagram according to embodiments of the type of figure
3;
figure 10 is a hybrid block schematic diagram illustrating the allocation of memory, and
the timing of interconnect masters access that memory according to embodiments of the
type of figure 3 and figure 6;
figure 11 is a block tic diagram illustrating portions of the embodiments of figures
1 and 3;
figure 12 is a block tic diagram illustrating preferred ments of the present
invention;
figure 13 is a hart illustrating processes of according to the embodiments of figure
figure 14 is a block schematic diagram illustrating portions of the embodiments of figures
3 and 12;
figure 15 is a high-level block schematic m illustrating a red embodiment of
the present ion;
figures 16 to 19 are flow-charts illustrating processes according to the ments of
figure 15; and
figure 20 is a diagram rating two sets of fields according to preferred embodiments
Signed B. Gittins
14 June 2020 8 NZ IP No. 716954
of the present invention.
Description of preferred embodiments of the invention
Figure 1 is a block schematic diagram rating ns of a shared memory computing
architecture (300) for preferred embodiments of the present invention. Shared memory
computing architecture (300) comprises 5 unidirectional onnect s (350, 351, 352,
353, 354). Each unidirectional interconnect bridge (350, 351, 352, 353, 354) comprises:
an interconnect target port ({350.ti, }, {351.ti, 351.te},{352.ti, 352.te}, {350.ti,
353.te}, {354.ti, 354.te}) comprising:
an ingress port (350.ti, 351.ti, 352.ti, 353.ti, 354.ti); and
an egress port (350.te, 351.te, 352.te, , 354.te);
an interconnect master port ({350.mi, 350.me}, {351.mi, 351.me}, {352.mi, 352.me},
{353.mi, 353.me}, i, 354.me}) comprising:
an ingress port (350.mi, 351.mi, 352.mi, 353.mi, 354.mi); and
an egress port (350.me, 351.me, 352.me, 353.me, );
a memory transfer request module (330, 332, 334, 336, 338) comprising:
an ingress port (350.ti, 351.ti, 352.ti, 353.ti, 354.ti);
an egress port (350.me, 351.me, 352.me, 353.me, 354.me);
a memory er response module (331, 333, 335, 337, 339) comprising:
an ingress port (350.ti, , 352.ti, 353.ti, ); and
an egress port (350.me, 351.me, 352.me, 353.me, 354.me).
The shared memory computing architecture (300) further comprises:
M interconnect masters (350, 351, 352, 353, 354), where the value of M is 5, in which
each interconnect master comprises:
an egress port (350.me, 351.me, 352.me, 353.me, 354.me); and
an ingress port (350.mi, , 352.mi, 353.mi, 354.mi); and
a first timeslot based interconnect (319) for transporting memory transfer requests and
their corresponding responses, comprising:
an arbiter and decoder module (360);
a M-to-1 multiplexer (321), comprising:
a select port;
M data input ports (320.a, 320.b, 320.c, 320.d, ; and
1 data output port (320.f);
and a 1-to-M demultiplexer (341), comprising:
a select port;
Signed B. Gittins
14 June 2020 9 NZ IP No. 716954
1 data input port (340.f); and
M data output ports (340.a, 340.b, 340.c, 340.d, 340.e);
in which:
for each interconnect master I:
the egress port of onnect master I is connected to the data input port I of the
M-to-1 multiplexer ({350.me, 320.a}, {351.me, , {352.me, 320.c},
{353.me, 320.d}, {354.me, 320.e}); and
the ingress port of interconnect master I is ted to the data output port I of
the 1-to-M demultiplexer ({350.mi, 340.a}, {351.mi, , {352.mi, 340.c},
{353.mi, 340.d}, {354.mi, 340.e});
the arbiter and decoder module (360) of the interconnect (319) controls the value
supplied on wire (361) to the select port of the M-to-1 multiplexer (321); and
the value supplied (on wire 342) to the select port of the 1-to-M demultiplexer (341) is
the value supplied to the select port of the M-to-1 multiplexer delayed by the first in first
out module (329) for L clock cycles, where the value of L is larger or equal to 3.
The onnect arbiter and decoder module (360) es as inputs the control signals, e.g. on
wire (362), generated by the 5 interconnect masters (350, 351, 352, 353, 354) that are received
on ports (320.a, 320.b, 320.c, 320.d, 320.e) respectively and the control signals on wire (363)
generated by the 1 interconnect target (370) and received on port (340.f). Preferably the
ling scheme of the interconnect r and decoder module (360) is adapted to consider
the state of those control signals (such as the values received on wires (362) and (363)).
The interconnect arbiter and decoder module (360) generates one or more control s
released as output on ports (340.a, 340.b, 340.c, 340.d, 340.e) that are supplied to the 5
interconnect master’s ingress ports (350.mi, 351.mi, , 353.mi, 354.mi). The interconnect
arbiter and decoder module (360) also generates one or more control signals as outputs (not
illustrated) which are ed over port (320.f) to the interconnect target’s (370) ingress port.
Preferably the arbiter and decoder module (360) of the first timeslot based interconnect (319)
employs at least one scheduling scheme selected from the group comprising:
a least recently granted interconnect master scheme (see figure 8);
a least recently granted interconnect master scheme with rate throttling on at least one
interconnect master (see figure 8);
a static timeslot scheme (see figure 5);
a dynamic timeslot scheme (see figure 2); and
a time triggered protocol scheme (see figure 7);
Signed B. Gittins
14 June 2020 10 NZ IP No. 716954
Preferably the shared memory computing ecture (300) is adapted such that:
the arbiter of the first timeslot based onnect (319) is adapted to:
grant a first timeslot to one of the M interconnect masters (350, 351, 352, 353,
354);
not grant the next timeslot to that interconnect master; and
grant one of the later timeslots to the that interconnect master;
the first interconnect master is adapted to:
issue a memory transfer request to a first interconnect target during the first
timeslot; and
the first interconnect target is adapted to:
transmit at least part of its response to the first interconnect master during the
later timeslot granted to the first onnect master.
Preferably at least one onnect target (370) can receive two or more outstanding memory
transfer requests before releasing a memory er response related to the first memory transfer
request. ably at least one interconnect master (350, 351, 352, 353, 354) is adapted to be
able to issue two or more outstanding memory transfer requests to that interconnect target (370)
before ing the memory transfer response corresponding to the first memory transfer
request to that interconnect target. For example when a processor core is d to concurrently
issue a first memory transfer request to retrieve executable code and a second memory transfer
request to access data.
Preferably the duration of least one timeslot of the interconnect (319) is 1 clock cycle in length.
For example, a first timeslot is 1 clock cycle in length, and the second timeslot is 1 clock cycle in
length. In an alternate preferred ment of the present invention, each timeslot of the
interconnect (319) has a variable duration of length that is upper-bound for that ot. For
example, the duration of the first ot is one 1 clock cycle and the duration of the second
timeslot ranges from 1 to 2 clock cycles in length.
For the remainder of the text describing figure 1, each timeslot of interconnect (319) has a
duration of 1 clock cycle in length, the FIFO module (329) releases the value of each input as
output 3 clock cycles later, and the sub modules (371), (373) and (372) of module (370) each
take 1 clock cycle to process their inputs and generate a corresponding output.
The shared memory computing architecture (300) further comprises an additional 5 interconnect
Signed B. Gittins
14 June 2020 11 NZ IP No. 716954
s (310, 311, 312, 313, 314), each comprising an egress port (310.e, 311.e, 312.e, 313.e,
314.e) and an ingress port (310.i, 311.i, 312.i, 313.i, . Each of the additional 5 interconnect
masters (310, 311, 312, 313, 314) are connected to the interconnect target ports of the 5
interconnect bridges (350, 351, 352, 353, 354) respectively.
The interconnect target (370) is an on-chip shared memory comprising one interconnect target
port, in which that target port:
is adapted to sustain a peak throughput of one memory transfer request per clock cycle;
is adapted to sustain a peak throughput of one memory transfer response per clock cycle.
ably at least one memory transfer request can be buffered by one or more of the M
unidirectional interconnect bridges. Preferably at least one of the M unidirectional interconnect
bridges is adapted to support read pre-fetching and write combining.
In some red embodiments, one or more of the M unidirectional interconnect bridges (350,
351, 352, 353, 354) are interconnect protocol transcoding bridges in which the protocol to
transcode is a bus interconnect protocol such as ARM AMBA AHB [2].
In some preferred embodiments, at least two of the M unidirectional interconnect bridges (350,
351, 352, 353, 354) are cache modules, in which each of those cache modules are d to
complete at least one memory transfer request from a cache-line stored in its cache-line store
without waiting for that cache module’s time-slot on the timeslot based interconnect (319). In
this way, each cache module has the capability to complete memory transfer requests at a rate
faster than the worst-case rate that timeslots are d to that cache module on the ot
based interconnect (319).
In some cases the data-path width of the 5 interconnect s (310, 311, 312, 313, 314) will be
less than the data-path width of the 5 cache s’ interconnect master ports ({350.mi,
350.me}, {351.mi, 351.me}, {352.mi, 352.me}, {353.mi, 353.me}, {354.mi, 354.me}). For
example, as illustrated in the block diagram 300 of figure 1, the data-path width of the 5
interconnect masters (310, 311, 312, 313, 314) is 32-bits (301), the data-path width of the
ot based interconnect (319) is 512-bits (302), and the data-path width of the on-chip
memory store (370) is 512-bits (302).
The use of N cache modules (350, 351, 352, 353, 354) connected to the same timeslot based
Signed B. Gittins
14 June 2020 12 NZ IP No. 716954
interconnect (319) is highly desirably when performing upper-bound worst case execution time
analysis of one or more tasks running in a N processor core (310, 311, 312, 313, 314)
architecture. Benefits include improved decoupling of the execution time of N concurrently
outstanding memory transfer requests issued by N different cores (310, 311, 312, 313, 314), and
to mask some of the access time latencies of memory transfer requests addressed to the shared
on-chip memory (370) over that timeslot based interconnect (319). Preferably each of those N
cache s (350, 351, 352, 353, 354) has a means for maintaining cache coherency with the
N-1 other cache modules (350, 351, 352, 353, 354) with zero unwanted timing interference
incurred against the memory er requests received on that cache’s interconnect target port.
Figure 1 also illustrates embodiments of the invention in which a shared memory computing
architecture (300) comprises:
a first clock (not illustrated);
M interconnect masters (350, 351, 352, 353, 354), where the value of M is 5;
1 interconnect target (370);
a first timeslot based onnect (319) for transporting memory transfer requests and
their corresponding responses, sing:
an input clock port (318) that is connected to the first clock;
a unidirectional timeslot based interconnect (320) to transport memory transfer
ts with T timeslots, where the value of T is 5;
a unidirectional timeslot based interconnect (340) to ort memory transfer
responses with R timeslots, where the value of R is 5, in which:
for each of the R timeslots, that timeslot:
corresponds to one memory transfer request timeslot; and
starts at least L clock cycles after the start time of that
corresponding memory request ot, where the value of L is 3;
in which:
interconnect target (370) is connected to the first timeslot based interconnect
(319);
for each interconnect master I of the M onnect masters (350, 351, 352, 353,
354):
each interconnect master I is ted to the first timeslot based
interconnect (319); and
each of the T timeslots is mappable to a different one of the M
interconnect masters.
Signed B. Gittins
14 June 2020 13 NZ IP No. 716954
The shared memory ing architecture (300) r comprises an on-chip random access
memory store (370), sing:
an input clock port that is connected to the first clock (not illustrated); and
at least one interconnect target port which is connected to the first timeslot based
interconnect (319), and in which:
each memory transfer request takes at most K clock cycles to complete under
fault-free ion, where the value of K is 3; and
that target port can sustain a throughput of 1 memory transfer t per clock
cycle.
In a preferred embodiment of the preferred invention the interconnect target (370) comprises:
a first delay buffer (371) to delay memory transfer requests;
an inner interconnect target (373);
a second delay buffer (372) to delay memory transfer responses;
in which:
the input of the interconnect target (370) is supplied as input to the first delay
buffer (371);
the output of the first delay buffer (371) is supplied as input to the module (373);
the output of the module (373) is supplied as input to the second delay buffer
(372); and
the output of the second delay buffer (372) is supplied as the output of the
interconnect target (370).
In this way, it is possible to transform any interconnect target into an interconnect target that
delays its memory transfer requests and memory transfer responses. The same type of approach
can be adapted to transform any interconnect master into an onnect master that delays its
memory transfer ts to the interconnect and delays their corresponding responses received
from that interconnect.
Figure 2 is a flow-chart illustrating the steps in a memory transfer request process (400) from an
interconnect master (310) to a memory store (370) of figure 1 according to preferred
embodiments of the present invention. In figure 2 the value of L is 3. Each of the interconnect
bridges (350) to (354) is adapted to:
buffer a single contiguous region of memory that is 512-bits wide;
perform t wide read and 512-bit wide write operations over its master port to the
interconnect (319);
Signed B. Gittins
14 June 2020 14 NZ IP No. 716954
support write combining of 32-bit write memory transfer requests received over its target
port to its 512-bit wide buffer; and
support 32-bit wide read memory transfer requests received over its target port to the
contents of that 512-bit wide buffer.
In step 410, start the interconnect master (310) read memory transfer request s.
In step 411, the interconnect master (310) issues a read memory transfer request of 32-bits over
the egress port (310.e) to the target port {350.ti, 350.te} of the interconnect bridge (350).
In step 412, the interconnect master (310) waits for and receives the memory transfer response
from the interconnect bridge (350) on the ingress port ). This completes the 32-bit read
memory transfer request issued in step 411.
In step 413, end the interconnect master (310) read memory transfer request process.
In step 420, start the interconnect bridge (350) memory transfer relay s.
In step 421, the onnect bridge (350) receives the 32-bit read memory transfer request issued
in step 411 on its interconnect target port {350.ti, 350.te}.
In step 422, the interconnect bridge (350) requests a timeslot on the ot based interconnect
over its interconnect master port {350.mi, }. This interconnect request signal is
transported over wire (362) and ed by the interconnect arbiter (360).
In step 423, the onnect bridge (350) waits one or more clock cycles until it is granted a
timeslot on the timeslot based onnect (319).
In step 424, the interconnect bridge (350) is allotted an upper-bound duration of time within the
timeslot to issue its memory transfer request and any associated data. The interconnect bridge
(350) issues a 512-bit read memory transfer request over its interconnect master port to the
timeslot based interconnect (319).
In step 425, the interconnect bridge (350) waits for the memory transfer request to be processed.
In this particular example, the interconnect bridge (350) does not issue any additional memory
transfer requests onto the timeslot based interconnect (319) while waiting for the currently
outstanding memory transfer request to be processed.
In step 426, the interconnect bridge (350) is notified by the timeslot based interconnect (319)
when the 512-bit wide read memory transfer request response is available. The interconnect
bridge is allotted an upper-bound duration of timeslot to receive the response of that memory
transfer request. The interconnect bridge (350) receives the response to its memory er
request and buffers it y.
In step 427, the interconnect bridge relays the requested s of data from the 512-bit read
memory transfer response over its interconnect target port back to the interconnect master (310).
Signed B. Gittins
14 June 2020 15 NZ IP No. 716954
In step 428, end the interconnect bridge (350) memory er relay process.
In step 430, start the timeslot based interconnect (319) memory transfer request cycle.
In step 431, the timeslot based onnect arbiter and decoder module (360) receives the value
on each interconnect request signal of the 5 interconnect bridges (350, 351, 352, 353, 354)
connected to the timeslot based interconnect (319).
In step 432, the timeslot based interconnect arbiter and decoder module (360) evaluates the
received value from each interconnect request signal according to the policy, configuration and
execution history of the currently active arbitration scheme. For example, if the timeslot based
interconnect arbiter is currently employing a least recently granted interconnect master scheme,
then the least ly granted interconnect master is selected from the set of interconnect
masters currently ting a timeslot on the interconnect (see figure 8). Alternatively, if the
ot based interconnect arbiter and decoder module (360) is currently using a cyclic ot
scheduling scheme, then the value on the interconnect request signals does not influence the
scheduling of timeslots.
In step 433, the timeslot based interconnect arbiter and decoder module (360) is illustrated as
having selected the interconnect bridge (350) for the next timeslot. The ot based
interconnect arbiter and decoder module (360) s the interconnect bridge (350) it has been
granted the next timeslot on the interconnect (319). In the next clock cycle, the timeslot based
interconnect r adjusts the value of the index to the multiplexer (321) to select the data-path
of port (320.a).
In step 434, a copy of the read memory transfer request and associated data is transmitted over
the onnect master port of the interconnect bridge (350) and is received on the data-path of
port (320.a).
In step 435, a copy of the read memory transfer request received by the timeslot based
interconnect (319) is forwarded to the memory store (370) which is connected to the onnect
target port (320.f) of the timeslot based interconnect (319). For example, the multiplexer (321)
forwards the selected information received on its data-path to the target port (320.f).
In step 436, the value supplied to the select input of the multiplexer (321) is delayed (329) for L
clock cycles.
In step 437, the value received on the data-path of the target port (340.f) is supplied as input to
the data input port of the iplexer (341). The select port of the demultiplexer receives the
value supplied to the select port of the multiplexer (321) L clock cycles earlier.
In step 438, the value received on target port (340.f) is forwarded to the interconnect bridge
(350) and received in step 426.
In step 439, end the timeslot based interconnect (319) memory transfer t cycle.
Signed B. Gittins
14 June 2020 16 NZ IP No. 716954
In step 440, start the memory store (370) memory transfer request cycle.
In step 441, memory store (370) receives a 512-bit wide read memory transfer request and delays
it in the buffer (371) for 1 clock cycle.
In step 442, the memory store (370) processes the read memory transfer request (373) in 1 clock
cycle and delays the memory transfer response output for 1 clock cycle in the buffer (327).
In step 443, the memory store (370) its the read memory transfer request response.
In step 445, end the memory store (370) memory transfer request cycle.
In a preferred embodiment of the present invention, a ng cache module (354) snoops every
memory transfer response released as output by the de-multiplexer (341) over wire (343).
ably each memory transfer response incorporates a copy of its corresponding memory
er request.
In a preferred embodiment of the present invention, each of the 5 interconnect master ports of the
interconnect (319) are connected to a different memory management unit (MMU) (380, 381,
382, 383, 384) respectively. In this way, the 5 MMU (380, 381, 382, 383, 384) provide a means
to enforce an access control policy between interconnect masters and the interconnect target
from within the interconnect (319).
In an alternate preferred embodiment of the present invention, onnect node (370) is an
interconnect master, and interconnect nodes (350) to (354) are ol transcoding s,
interconnect nodes (310) to (314) are interconnect targets, and modules (380) to (384) are not
used.
Figure 3 is a block schematic diagram illustrating portions of a shared memory computing
architecture (500) according to preferred embodiments of the present invention. The shared
memory computing ecture (500) comprises:
M interconnect masters (540, 541, 542, 543, 544), where the value of M is 5, in which
each interconnect master comprises:
an egress port (540.me, , 542.me, 543.me, 544.me); and
an s port (540.mi, 541.mi, 542.mi, 543.mi, 544.mi); and
a first timeslot based interconnect (501) for transporting memory transfer requests and
their corresponding responses, comprising:
an arbiter and decoder module (510);
a M-to-1 multiplexer (521), comprising:
Signed B. Gittins
14 June 2020 17 NZ IP No. 716954
a select port;
M data input ports (520.a, 520.b, 520.c, 520.d, 520.e); and
1 data output port;
and a 1-to-M iplexer (531), comprising:
a select port;
1 data input port; and
M data output ports (531.a, 531.b, 531.c, 531.d, 531.e);
in which:
for each onnect master I:
the egress port of interconnect master I is connected to the data input port
I of the M-to-1 multiplexer ({540.me, 520.a}, {541.me, 520.b}, {542.me,
520.c}, {543.me, 520.d}, {544.me, 520.e}); and
the ingress port of interconnect master I is connected to the data output
port I of the 1-to-M iplexer ({540.mi, 531.a}, {541.mi, 531.b},
{542.mi, 531.c}, {543.mi, 531.d}, {544.mi, 531.e});
the arbiter and decoder module (510) of the interconnect (501) controls the value
supplied on wire (511) to the select port of the M-to-1 multiplexer (521); and
the value supplied on wire (513) to the select port of the 1-to-M demultiplexer
(531) is the value supplied to the select port of the M-to-1 lexer delayed by
the first in first out module (515) by L clock cycles, where the value of L is 3.
The shared memory computing architecture (500) further comprises:
S interconnect targets (560, 561, 562, 563, 564), where the value of S is 5, each
interconnect target comprising:
an egress port (560.e, 561.e, 562.e, 563.e, 564.e); and
an ingress port (560.i, 561.i, 562.i, 563.i, ;
in which the first timeslot based interconnect for orting memory transfer requests
and their corresponding responses further comprises:
a 1-to-S demultiplexer (522), comprising:
a select port;
1 data input port; and
S data output ports , 520.g, 520.h, 520.i, 520.j); and
and a S-to-1 multiplexer (532), comprising:
a select port;
S data input ports (530.f, 530.g, 530.h, 530.i, 530.j); and
1 data output port;
Signed B. Gittins
14 June 2020 18 NZ IP No. 716954
in which:
the data input port of the 1-to-S demultiplexer (522) receives as input the output
of the M-to-1 multiplexer (521);
the data input port of the 1-to-M demultiplexer (533) receives as input the output
of the S-to-1 multiplexer (533);
for each interconnect target J:
the ingress port of interconnect target J is connected to the data output port
I of the 1-to-S demultiplexer ({560.i, 520.f}, {561.i, 520.g}, {562.i,
520.h}, {563.i, 520.i}, , 520.j}); and
the egress port of interconnect target J is connected to the data input port S
of the S-to-1 multiplexer e, 530.f}, {561.e, , , 530.h},
{563.e, 530.i}, {564.e, 530.j}); and
the arbiter and decoder module (510) of the onnect controls the value
ed on wire (512) to the select port of the 1-to-S demultiplexer (522); and
the value supplied on wire (514) to the select port of the S-to-1 multiplexer is the
value supplied to the select port of the 1-to-S demultiplexer (522) delayed by the
first in first out module (516) by L clock cycles.
In figure 3, the data-path width of the interconnect 501 is 32-bits (599).
The interconnect arbiter and r module (510) receives as inputs the l signals (not
illustrated) generated by the 5 interconnect masters (540, 541, 542, 543, 544) that are received on
ports (520.a, 520.b, 520.c, 520.d, 520.e) respectively and the control signals (not illustrated)
generated by the 5 interconnect targets (560, 561, 562, 563, 564) and received on ports (530.f,
530.g, 530.h, 530.i, 530.j). Preferably one or more of the scheduling scheme of the r and
decoder module (510) is adapted to consider the state of those control signals.
The interconnect arbiter and decoder module (510) generates one or more control signals as
output on ports (530.a, 530.b, 530.c, 530.d, 530.e) that are supplied to the 5 interconnect
master’s ingress ports (540.mi, 541.mi, 542.mi, 543.mi, ) respectively. The interconnect
arbiter and decoder module (510) also generates one or more control signals as outputs (not
illustrated) which are supplied over ports (320.f, 320.g, 320.h, 320.i, 320.j) to the ingress ports
(560.i, 561.i, 562.i, 563.i, 564.i) of the interconnect targets (560, 561, 562, 563, 564)
respectively.
Preferably the arbiter and decoder module (510) of the timeslot based onnect (501)
Signed B. Gittins
14 June 2020 19 NZ IP No. 716954
employs at least one scheduling scheme selected from the group comprising:
a least recently granted interconnect master scheme (see figure 8);
a least recently granted interconnect master scheme with rate throttling on at least one
interconnect master (see figure 8);
a static timeslot scheme (see figure 5);
a dynamic timeslot scheme; and
a time triggered protocol scheme (see figure 7).
Preferably the shared memory computing architecture (500) is adapted such that:
the arbiter and decoder module (510) of the first timeslot based onnect (501) is
adapted to:
grant a first timeslot to one of the M onnect masters (540, 541, 542, 543,
544);
not grant the next ot to that interconnect master; and
grant one of the later timeslots to the that interconnect ;
the first interconnect master is adapted to:
issue a memory transfer request to a first interconnect target during the first
timeslot; and
the first interconnect target is adapted to:
transmit at least part of its response to the first onnect master during the
later timeslot d to the first interconnect master.
Preferably at least one interconnect target (560, 561, 562, 563, 564) can receive two or more
outstanding memory transfer requests before releasing a memory transfer response related to the
first memory transfer request. Preferably at least one interconnect master (560, 561, 562, 563,
564) can issue two or more outstanding memory transfer requests to that interconnect target
before receiving the memory transfer response ponding to the first memory transfer
request to that interconnect target. For example a processor core (540) may concurrently issue a
memory transfer request to retrieve able code and a memory transfer request to access
data.
Preferably the duration of least one timeslot of the first ot based interconnect (501) is 1
clock cycle in length. For example, a first timeslot is 1 clock cycle in length, and the second
timeslot is 1 clock cycle in length. In an alternate preferred embodiment, each timeslot of the
first timeslot based interconnect has a variable duration of length that is upper-bound for that
timeslot. For example, the duration of the first timeslot is one 1 clock cycle and the duration of
Signed B. Gittins
14 June 2020 20 NZ IP No. 716954
the second ot ranges from 1 to 2 clock cycles in length.
For the remainder of the text describing figure 3, each timeslot of the onnect (501) has a
duration of 1 clock cycle in length, the FIFO (515) releases the value of each input as output 3
clock cycles later, the FIFO (516) releases the value of each input as output 3 clock cycles later,
and the on-chip memory store (560) releases its output after 3 clock cycles. The interconnect
target peripherals (561) to (564) take a variable amount of time to generate memory transfer
responses to the memory transfer requests they receive.
Figure 4 is a flow-chart illustrating (600) the steps of two rent memory er requests
issued from 2 interconnect masters in the same clock-cycle to two different interconnect targets
of figure 3. The value of L is 3, the onnect arbiter and decoder module (510) is employing
a static round-robin timeslot schedule in which each timeslot has a fixed duration of 1 clock
cycle in length according to a preferred embodiment of the present invention. In this
pedagogical example, the interconnect masters (540) to (544) are adapted to issue memory
transfer requests in the same clock cycle they receive notification of being d the current
timeslot. Furthermore, the onnect arbiter and decoder module (510) is assumed to already
be started and operating.
In clock cycle 1 (601):
In step 631, the interconnect arbiter and decoder module (510) grants the current timeslot
of the ot based interconnect (501) to interconnect master (543). Interconnect
master (543) does not issue a memory transfer t.
In step 610, start the memory transfer request process for interconnect master (540).
In step 611, interconnect master (540) requests a timeslot on the timeslot based
interconnect (501).
In step 620, start the memory transfer request process for interconnect master (541).
In step 621, interconnect master (541) requests a timeslot on the timeslot based
interconnect (501).
In clock cycle 2 (602):
In step 632, the onnect arbiter and decoder module (510) grants the current timeslot
of the interconnect (501) to interconnect master (544), that interconnect master does not
issue a memory transfer request.
In clock cycle 3 (603):
In step 633, the interconnect arbiter and decoder module (510) signals to interconnect
master (540) that it has been granted the current timeslot on the interconnect (501). The
Signed B. Gittins
14 June 2020 21 NZ IP No. 716954
onnect arbiter and decoder module sets the value of the select input of the
multiplexer (521) to select interconnect master (540). That value is also forwarded to the
delay module (515) and is delayed for 3 clock cycles before being forwarded to the select
input of demultiplexer (531).
In step 612, the interconnect master (540) issues a memory transfer request addressed to
peripheral (562) along with all associated data to the timeslot based interconnect (501) in
one clock cycle.
In step 633, the interconnect r and decoder module (510) decodes the address of
that memory transfer request, identifies that the memory address corresponds the address
range of the peripheral (562) and sets the value of the select input on the iplexer
(522) to select peripheral (562). That value is also forwarded to the delay module (516)
and is delayed for 3 clock cycles before being forwarded to the select input of multiplexer
(532).
In clock cycle 4 (604):
In step 634, the interconnect arbiter and decoder module (510) signals to interconnect
master (541) that it has been granted the current timeslot on the interconnect (501). The
interconnect arbiter and decoder module (510) sets the value of the select input of the
multiplexer (521) to select onnect master (541). That value is also forwarded to the
delay module (515) and is delayed for 3 clock cycles before being forwarded to the select
input of demultiplexer (531).
In step 622, the interconnect master (541) issues a memory transfer request addressed to
peripheral (563) along with all associated data in one clock cycle to the timeslot based
interconnect (501).
In step 634, the interconnect arbiter and decoder module (510) decodes the address of
that memory transfer request, fies that the memory address corresponds the address
range of the eral (563) and sets the value of the select input on the demultiplexer
(522) to select peripheral (563).
In clock cycle 5 (605):
In step 635, the interconnect arbiter and r module (510) grants the current timeslot
of the onnect to interconnect master (542). Interconnect master (542) does not
issue a memory transfer request.
In clock cycle 6 (606):
The peripheral (562) generates its memory transfer response to the onnect transfer
request issued in step 612.
In step 636, the interconnect arbiter and decoder module (510) grants the current timeslot
of the interconnect to interconnect master (543). Interconnect master (543) does not
Signed B. Gittins
14 June 2020 22 NZ IP No. 716954
issue a memory transfer request. The index to the multiplexer (532) selects eral
(562), and the demultiplexer (531) selects interconnect master (540), forwarding the
entire memory transfer response from the eral (562) to interconnect master (540) in
one clock cycle.
In step 613 the interconnect master (540) receives the response.
In clock cycle 7 (607):
The peripheral (563) generates its se to the interconnect er request issued in
step 613.
In step 637, the interconnect arbiter and decoder module (510) grants the current timeslot
of the interconnect (501) to interconnect master (544). Interconnect master (544) does
not issue a memory transfer t. The index to the multiplexer (532) selects
peripheral (563), and the demultiplexer (531) selects interconnect master (541),
forwarding the entire memory er response from the peripheral (563) to interconnect
master (541) in one clock cycle.
In step 623, the interconnect master (541) receives the response.
End of the memory transfer request process for interconnect master (540).
In clock cycle 8 (608):
In step 638, the interconnect arbiter and decoder module (510) grants the current timeslot
of the interconnect to interconnect master (540). Interconnect master (540) does not
issue a memory transfer request.
End of the memory transfer request process for interconnect master (541).
In a preferred embodiment of the present invention, a snarfing cache module (544) snoops every
memory transfer response released as output by the de-multiplexer (531) over wire (534).
Preferably each memory transfer se incorporates a copy of its corresponding memory
transfer request.
In a red embodiment of the present invention, each of the 5 interconnect master ports of the
interconnect (501) are ted to a different memory management unit (MMU) (not
illustrated) respectively. In this way, the 5 MMU provide a means to enforce an access control
policy between interconnect masters and the interconnect target from within the onnect
(501).
It is further preferred that the means to enforce an access control policy is adapted to ensure that
no more than one interconnect master (540 to 544) can issue memory transfer requests to a given
interconnect target (560 to 564). In this way the access control policy tees that a memory
Signed B. Gittins
14 June 2020 23 NZ IP No. 716954
transfer request to that interconnect target (560 to 564). will not be delayed by another other
interconnect master (540 to 544).
In some cases, for the purpose of increasing the clock-speed of the circuitry, it may be desirable
to increase the pipeline depth of the interconnect (501) by adding ers (523) and (533).
In a preferred embodiment of the present invention, each of the M interconnect masters (540,
541, 542, 543, 544) are interconnect bridges.
Figure 5 is a timing m illustrating 3 rows of timing events (200) for memory transfer
ts (220), their completion times (230) and their response times (240) on a timeslot based
interconnect for transporting memory transfer requests generated by a shared memory computing
architecture of the type illustrated in figure 3 ing to a red embodiment of the t
invention.
Timeline 210 illustrates 13 timeslots, the duration of each timeslot being 1 clock cycle in length.
Row 220 illustrates the consecutive g of 7 interconnect s (not illustrated) labelled
(A) to (G) to 13 timeslots in a statically scheduled round-robin scheme with a period of 7 clock
cycles (201). In this illustration each interconnect master continually issues back-to-back
blocking read memory transfer requests. By ng, it is meant that each interconnect master
waits for the response of any of its outstanding memory transfer requests before issuing its next
memory transfer request. In this illustration, each interconnect master is issuing a memory
transfer request to a different interconnect target (not illustrated).
Specifically, row (220) rates the timing of memory transfer requests issued on a
unidirectional timeslot based interconnect with 7 timeslots as follows: the first memory transfer
request is issued by interconnect master (A) at timeslot (220.1); the first memory transfer request
is issued by interconnect master (B) at timeslot (220.2); the first memory transfer request is
issued by interconnect master (C) at timeslot (220.3); the first memory transfer request is issued
by interconnect master (D) at ot (220.4); the first memory transfer request is issued by
onnect master (E) at ot (220.5); the first memory transfer t is issued by
interconnect master (F) at timeslot (220.6); the first memory transfer request is issued by
interconnect master (G) at timeslot (220.7); the second memory transfer request is issued by
interconnect master (A) at timeslot (220.8); no memory transfer request is issued by interconnect
master (B) at timeslot (220.9); the second memory transfer request is issued by interconnect
Signed B. Gittins
14 June 2020 24 NZ IP No. 716954
master (C) at timeslot 0); the second memory transfer request is issued by interconnect
master (D) at timeslot (220.11); the second memory transfer request is issued by interconnect
master (E) at timeslot (220.12); and the second memory transfer request is issued by interconnect
master (F) at timeslot (220.13).
Row 230 rates the time at which each memory transfer request completes: no memory
transfer requests are completed on timeslots ), (130.2), (130.3) and (130.5); the memory
transfer request (220.1) completes at timeslot (230.4); the memory transfer request (220.2)
completes at timeslot (230.8); the memory transfer request (220.3) completes at timeslot (230.6);
the memory transfer request (220.4) completes at ot (230.7); the memory transfer request
(220.5) tes at timeslot (230.8); the memory transfer request (220.6) tes at timeslot
(230.9); the memory transfer t (220.7) completes at ot (230.10); the memory transfer
request (220.8) completes at timeslot (230.11); the memory transfer request (220.9) completes at
ot (230.12); and the memory transfer request (220.10) tes at timeslot (230.13).
Row 240 illustrates the timing of memory transfer responses on a second unidirectional timeslot
based interconnect with 7 timeslots: the memory transfer request (220.1) receives its completion
response at timeslot (240.4); the memory transfer request (220.2) receives a completion g
response at timeslot (240.5); the memory transfer t (220.2) receives its completion
response at timeslot (240.11); the memory transfer t ) receives its completion
response at timeslot (240.6); the memory transfer request (220.4) receives its completion
se at timeslot (240.7); the memory transfer request (220.5) receives its completion
se at timeslot (240.8); the memory transfer request (220.6) receives its completion
response at timeslot (240.9); the memory transfer request (220.7) receives its completion
response at timeslot (240.10); the memory er request (220.8) receives its completion
response at timeslot (240.11); there is no memory transfer request issued at (220.9); the memory
transfer t (220.10) receives its completion se at timeslot (240.13).
In this illustration (200), the interconnect targets of interconnect masters (A) and (C) to (G)
complete are guaranteed to complete their memory transfer request within 3 timeslots (254),
where as the interconnect target of interconnect master (B) is guaranteed to complete its memory
er request within 6 timeslots (253).
Figure 5 illustrates that the alignment of the memory transfer request timeslots (120) and the
memory transfer response timeslots ({220.1, 240.4}, {220.2, 240.5}, {220.3, 240.6}, …) are
phase shifted by 3 clock cycles to the right (241). In this case, 9 out of 10 memory transfer
Signed B. Gittins
14 June 2020 25 NZ IP No. 716954
responses (240.4, 240.6, 240.7, 240.8, 240.9, 240.10, 240.11, 240.12, 240.13) were not delayed
(254) longer than necessary (258), resulting in significantly improved performance when
compared to not phase shifting the time between the request timeslot and response timeslots.
Only one (230.B1) of the 13 memory transfer ses (230) was delayed. In this case, it was
delayed by 4 clock cycles (257). Advantageously, the idle timeslot (240.5) and the delay of the
memory transfer response (230.8) had no impact on the timing of memory transfer
requests/responses of any other interconnect masters. Ideally the phase shifting is selected to
optimise for the round-trip time for the majority of memory transfer requests at the cost of a
relatively small increase in latency for the minority.
In this way we have described the timing behaviour of a shared memory computing architecture
that comprises:
M interconnect masters (A, B, C, D, E, F, G), where the value of M is 7;
7 interconnect s;
a first timeslot based interconnect for transporting memory transfer requests and their
corresponding responses, comprising:
a unidirectional timeslot based interconnect to transport memory transfer requests
(220) with T ots, where the value of T is 7 (201);
a unidirectional timeslot based interconnect to transport memory transfer
responses (240) with R timeslots, in which:
for each of the R timeslots, that timeslot:
corresponds to one memory transfer request timeslot ({240.4,
220.1}, {240.5, 220.2}, …); and
starts at least L clock cycles (241) after the start time of that
corresponding memory request timeslot 1, 240.4} through to
0, }), where the value of L is at least 3 and less than
the value of T;
all 7 interconnect s are connected to the first timeslot based interconnect;
for each interconnect master I of the M onnect masters (A, B, C, D, E, F, G):
each interconnect master I is connected to the first ot based onnect;
in which each of the T timeslots (220.1, 220.2, 220.3, 220.4, 220.5, 220.6, 220.7)
is mappable to a different one of the M interconnect masters (A, B, C, D, E, F, G).
Furthermore, figure 5 illustrates that the value of R (which is 7) equals the value of T (which is
7), and each of the T memory transfer t timeslots (220.1, 220.2, 220.3, 220.4, 220.5,
Signed B. Gittins
14 June 2020 26 NZ IP No. 716954
220.6, 220.7) on the first timeslot based interconnect has a corresponding memory transfer
response timeslot (240.4, 240.5, 240.6, 240.7, 240.8, 240.9, 240.10) of the same length (1 clock
cycle) on that onnect.
Figure 6 is a block schematic diagram illustrating portions of a shared memory computing
architecture (700), ing embodiments of figure 3 according to a preferred embodiment of
the present invention. The shared memory computing architecture (700) comprises:
a first system onnect (720) of the type described in figure 3;
an on-chip random access memory store (761) comprising at least one interconnect target
port ({761.i1, 761.e1}, {761.i1, 761.e1}), in which the first interconnect target port
{761.i1, } is connected to the first system (720) interconnect;
at least two mputing devices (730, 740), in which:
the first (730) of the at least two sub-computing device (730, 740) comprises:
a first local interconnect (710) comprising:
a unidirectional interconnect (711) for transporting memory
transfer requests; and
a unidirectional interconnect (712) for transporting the
corresponding memory transfer responses;
a first interconnect master (731) connected to a local interconnect (710) of
the sub-computing device;
a unidirectional interconnect bridge , 733.b} comprising two ports,
in which:
the first port is connected to the first system interconnect (720);
the second port is connected to a local interconnect (710) of the
sub-computing device; and
in which the first interconnect master (731) is d to issue memory
er requests to the on-chip random access memory store (761) over
the unidirectional interconnect bridge {733.a, 733.b}; and
the second (740) of the at least two sub-computing device (730, 740) comprises:
a first local interconnect (715) comprising:
a unidirectional interconnect (716) for transporting memory
transfer requests; and
a ectional interconnect (717) for transporting the
corresponding memory transfer responses;
a first interconnect master (741) connected to a local interconnect (715) of
Signed B. Gittins
14 June 2020 27 NZ IP No. 716954
the sub-computing device; and
a unidirectional interconnect bridge {743.a, 743.b} sing two ports,
in which:
the first port is connected to the first system interconnect (720);
and
the second port is connected to a local interconnect of the subcomputing
device (715); and
in which the first interconnect master (741) is adapted to issue memory
er requests to the on-chip random access memory store (761) over
the unidirectional interconnect bridge {743.a, 743.b}; and
a first peripheral (751), comprising:
a first interconnect target port (751.t1) which is connected to the first local
interconnect (710) of the first (730) of the at least two sub-computing devices
(730, 740); and
a first interconnect master port (751.m1) which is adapted to issue memory
transfer requests to the on-chip random access memory store (761);
in which:
the first interconnect master (731) of the first (730) of the at least two puting
devices (730, 740) is adapted to issue memory transfer requests to the
first peripheral (751).
The first peripheral (751) of the shared memory computing architecture (700) r ses:
a second interconnect target port (751.t2) which is ted to the first local
interconnect (715) of a second (740) of the at least two sub-computing devices (730,
740); and
the first interconnect master (741) of the second (740) of at least two sub-computing
devices (730, 740) is adapted to issue memory transfer requests to the first peripheral
(751).
The shared memory computing architecture (700) further comprises:
a second peripheral (752), comprising a first interconnect target port (752.t1) which is
connected to the first system interconnect (720);
in which the first interconnect master (731, 741) of at least two (730, 740) of the at least
two sub-computing s (730, 740) is adapted to issue memory transfer requests to the
second peripheral (752).
Signed B. Gittins
14 June 2020 28 NZ IP No. 716954
The first peripheral (751) of the shared memory computing architecture (700) r comprises
a first interconnect master (751.m1) which is d to issue memory transfer requests to the
p random access memory (761) over the interconnect (720).
The multiprocessor interrupt ller (771) with software le interrupt lines is adapted to
map one or more interrupt lines between each peripheral (751, 752) and one or more
interconnect masters (731, 741). The multiprocessor interrupt controller has a dedicated
interconnect target port (772, 773) for each of the at least two sub-computing devices (730, 740).
Preferably, the private memory store (732) is ted as an interconnect target to the local
interconnect (710) of the sub-computing device (731).
ably, each port of the dual-port time-analysable memory controller and off-chip memory
store (762) is connected as a onnect target to the timeslot based interconnect (720).
Preferably, the timer module (742) has a interconnect target port which is connected to
interconnect (715) of the sub-computing device (740) that can generate an interrupt which is
exclusively received (not illustrated) by interconnect master (741).
In figure 6 the interconnect master (731) can issue memory transfer requests to interconnect
target (732) and the interconnect target port ) of the interconnect bridge {733.a, 733.b} to
the timeslot based interconnect (720) over interconnect (710). This capability permits scaling of
the number of interconnect target devices accessible by the interconnect master (731) in a
statically time-analysable manner without increasing the number of time-slots on one or more
timeslot based interconnects (720). This also permits frequent, latency sensitive, memory
transfer requests from (731) to be serviced by a interconnect target device (732), without
incurring multi interconnect master arbitration latency penalties that are present on the timeslot
based interconnect (720).
Preferably the first system interconnect (720) is a timeslot based onnect. A desirable
property of connecting the interconnect masters erals (751, 752) ly to the timeslot
based interconnect (720) is that it becomes trivially easy to calculate the upper-bound latency of
their memory transfer requests and the peak bandwidth that can be sustained to the on-chip
memory (761).
Preferably, the shared memory computing device (700) of figure 6 comprises a means, such as
Signed B. Gittins
14 June 2020 29 NZ IP No. 716954
the use of memory ment units (not illustrated), to enforce an access control policy that
limits which interconnect masters ({733.a, 733.b}, {743.a, 743.b}, 751, 752) can issue memory
transfer requests to which interconnect targets ({752.t1}, 761).
In an alternate preferred embodiment, the shared memory computing architecture (700) r
comprises a second system interconnect (799) in which:
the on-chip random access memory store (761) has at least two onnect target ports
({761.i1, 761.e1}, {761.i2, 761.e2});
the second interconnect target port {761.i2, 761.e2} of the random access memory store
(761) is ted to the second system interconnect (799);
the first interconnect master port of the first peripheral is disconnected from the first
system interconnect (720) and connected to the second system interconnect (799); and
the first interconnect master port of the second eral is disconnected from the first
system interconnect (720) and connected (not illustrated) to the second system
interconnect (799).
Figure 7 is a block diagram illustrating a static timeslot le (810) with a cycle of 24 fixed
timeslots (801 to 824) of 1 clock cycle each that rotate ally left (850) by 1 entry every
clock cycle for preferred embodiments of the present invention. The 4 interconnect masters (1,
2, 3, 4) are scheduled once every second timeslot (801, 803, 805, …), such that each interconnect
master is scheduled once every eight timeslots. For example, interconnect master is scheduled in
timeslots (801, 809, 817). The value (illustrated as corresponding to interconnect master 1) in
element (825) is used by the arbiter and decoder module to l which interconnect master is
granted access to a given timeslot based interconnect. The 12 onnect master peripherals (5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) are scheduled one every second timeslot, such that each of
those 12 onnect master peripherals is scheduled once every 24 timeslots. In this way, the 4
onnect masters (1, 2, 3, 4) are granted higher-frequency access, and thus proportionally
more bandwidth, than the other 12 interconnect master peripherals. This particular scheduling
scheme is well suited to managing 4 processor cores along with 12 onnect master
peripherals on one timeslot based interconnect, such as interconnect (720) of figure 6. Clearly
each interconnect master peripheral (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) must be able to
buffer data to write without loss for up to 24 clock cycles.
Figure 8 is a block diagram that illustrates a least recently granted (LRG) interconnect master
scheme with 16 time-slots of 1 clock cycle each according to preferred embodiment of the
present invention. Region (860) illustrates the value of the 16 timeslots (861 to 876) in the first
Signed B. Gittins
14 June 2020 30 NZ IP No. 716954
clock cycle and region (880) rates the value of the same 16 timeslots (881 to 896) in the
second clock cycle. The LRG scheme ensures that if all 16 interconnect masters are
concurrently g memory transfer request at an equal rate, then each interconnect master is
granted equal number of timeslots to the onnect. On the other hand, if less than 16
interconnect masters are concurrently issuing memory transfer requests, then the available
bandwidth is opportunistically allocated to the active interconnect masters. In figure 8, at the
start of the first clock-cycle (860) interconnect masters 4 (864), 9 (869), 10 (870), and 12 (872)
have issued memory transfer requests and are waiting to be granted on the timeslot based
interconnect. In this case, the least recently granted interconnect master was onnect master
12 (872), and that interconnect master is granted access to the current timeslot on the timeslot
based interconnect. At the start of the next clock cycle (860), interconnect master 12 (881) is
placed at the start of the queue, interconnect masters 1 to 11 (861, …, 871) age by one clock
cycle (862, …, 892), and interconnect master 6 (886) issues a memory transfer request to the
timeslot based interconnect. In this clock cycle (880), the least recently granted interconnect
master with a pending memory transfer request is 10 (891), and it is granted access to the t
timeslot of the ot based onnect.
In a further preferred embodiment of the t invention, a rate limiting counter is associated
with each of the 16 interconnect masters, for example counter (897) for interconnect master 12
(881). The rate limiting counter ses by one each clock cycle, stopping at zero. When the
timeslot based interconnect is reset, each interconnect master is assigned a value indicating how
many clock cycles must pass before that interconnect master may be granted the timeslot based
interconnect by the arbiter after having completed a memory transfer t. This rate-limiting
capability can be used to reduce power consumption (by ng the number of reads and/or
writes to the shared memory store) and to ensure higher-bandwidth or higher-frequency devices
have greater opportunity to be granted the timeslot based interconnect.
Figure 9 is a table illustrating an access l list (ACL) (900) for 8 interconnect masters (902)
and 16 interconnect targets (901) connected to a timeslot based interconnect, for preferred
embodiments of the present invention. The label ‘X’ illustrates that a specific interconnect
master may access a specific onnect target, where its absence indicates prohibition of
access.
Figure 9 illustrates an access control list policy (900) in which that ACL policy has been
configured in such a way that no more than one interconnect master can issue memory transfer
requests to a given interconnect target on that timeslot based interconnect. For example the first
Signed B. Gittins
14 June 2020 31 NZ IP No. 716954
interconnect target (910) may only be accessed by the third interconnect master (922) as
illustrated by the label ‘X’ (in column 1, row 3) and its absence in every other column of row 3.
An interconnect master may be permitted by the ACL policy to access more than one
interconnect target. For example the first interconnect master (920) is permitted to issue memory
transfer requests to the second (911) and fourth (913) interconnect s. Furthermore, an
interconnect master may not be permitted to issue memory transfer requests to any interconnect
target peripherals on that interconnect, as illustrated by the row five of the table for the fifth
interconnect master (924).
Figure 9 can be encoded as a 1 ional array of 64-bits in length, partitioned into 16
elements (one for each interconnect target), each element being 4-bits in length and indicating
which one of the up to 16 interconnect masters may access it.
Preferably, the ACL policy is adapted to be dynamically adjusted at me by supervisor
re, such as a hypervisor or ing system, in response to the set of currently active
tasks. Preferably there are two levels of ACL policy. A first ACL policy specifying which set of
interconnect masters are ted to be mapped to any given interconnect target, and a second
ACL policy that selects which (if any) one of those interconnect masters is currently assigned to
any given interconnect target. This then permits a system-level supervisory software to set
system level ACL constraints, while permitting each sub-computing device to independently
select a valid ACL configuration from the permissible sub-set of all le configurations for
that sub-computing .
Figure 10 is a hybrid block schematic diagram illustrating the allocation/partitioning (1100) of
memory ((761) of figure 6), and the timing of interconnect masters, specifically software tasks
(1120) g on processor cores and peripherals (1130), accessing that memory ((761) of
figure 6) according to embodiments of the type of figure 3 and figure 6. In this illustration the
width of the ot based interconnect ((720) of figure 6) is 1024 bits in length, and each of the
elements of memory in memory store (761) is also 1024 bits in .
Logical partition 1101 illustrates two ts of memory store (761) allocated to store the
content of a network packet for a peripheral that performs operations on 2048-bit long packets.
Logical partition (1102) shows 6 ts of memory store (761) allocated for use by memory
transfer requests issued by at least one interconnect master port of that peripheral. Logical
partitions (1103) and (1104) are allocated 2 elements of memory store (761) which are used as
end-of-queue buffers, so that while one packet is being written into one of the two logical
Signed B. Gittins
14 June 2020 32 NZ IP No. 716954
partitions, the other packet in the other logical partition is being transferred to an independent
bly off-chip) memory. This permits the head-of-queue packets to be stored in SRAM store
(761) while still having buffers allocated for receiving and ading packets as they arrive
from that peripheral to an independent memory.
Logical partition (1105) illustrates 12 elements of memory assigned to 12 time-slots of a timetriggered
protocol with variable length lots of up to 1024-bits in length.
Logical partitions (1107, 1108, 1109, 1110, 1111) are assigned to a single network eral
that has 5 virtual ports. Each of those 5 logical partitions may be assigned exclusively to a
different processor core and/or operating system instance and/or communications session. In
preferred ments of the present invention the number of virtual queues, and the length of
each l queue assigned to a peripheral is dynamically set at boot up, and those preferences
are communicated to the peripheral over its interconnect target port, or a ion in (1100)
storing configuration data.
Logical partition (1112) is left unallocated.
Logical partition (1113) is allocated for sending and receiving messages between two RTOS
instances running on a first processor core and a second processor core. Preferably, the two
RTOS instances are configured to further sub-partition that space.
Timeline 1119 rates four ({1121, 1123}, {1123, 1125}, {1125, 1127}, {1127, 1229}) time
and space (T&S) partitions for re tasks (1122, 1124, 1126) illustrated in region (1120). A
first task (1122) operates in the first T&S partition {1121, 1132} on sor core (731), a
second task (1124) operates in a second T&S partition on processor core (731), a third task
(1126) operates in a third T&S partition on processor core (731). With regard to peripheral
activity (1130), a peripheral (752 of figure 6) receives a packet transmitted to it over a public
wide-area network, and writes that packet into partition . Due to unknown latencies
introduced at run time by competing traffic over the public wide-area network, it is not possible
to accurately predict at what time that packet will arrive. That packet is processed by the task
(1126) in the third T&S partition, and a new packet of data is generated by that task (1126) and
written into partition (1105). The onnect master port of that peripheral (752) accesses the
partition (1105) to retrieve that new packet so that it can be transmitted over the wide area
network. The tasks (1122, 1124, 1126) all access memory (1100) during their ated timeslots.
Signed B. Gittins
14 June 2020 33 NZ IP No. 716954
Advantageously, when the timeslot based interconnect (720) is running a fixed time-slot
scheduling scheme, the reception (1131) and transmission (1132) of packets results in no
unwanted/uncontrolled timing interference for the memory transfer requests issued by processor
core (731) to (732). As there is no uncontrolled timing interference, static worst case execution
time analysis of tasks running on core (731) can be achieved with tighter bounds than with the
conventional multi-core architectures in which multiple processor cores and interconnect master
peripherals are permitted work-preserving access to SDRAM. When the timeslot based
interconnect is running in a least recently granted interconnect master mode without rate limiters,
the timing interference is upper bound to the lent of a static timeslot scheduling scheme
with one timeslot per interconnect master.
Advantageously, the 1024-bit wide SRAM (720) offers exceptionally high bandwidth when
compared to a 64-bit wide double-data-rate off-chip SDRAM channel operating at comparable
clock-speeds. It is possible to use the relatively high aggregate bandwidth of the SRAM (720) to
ensure that every peripheral has sufficient bandwidth to e at its (off-chip I/O) wire-speed,
even in a static timeslot led environment servicing multiple interconnect s. This
approach tends to significantly se the total ive usable memory bandwidth within a
computing device. For example, in many cases, a packet sent or received by a peripheral may
not ever have to be written to the relatively low-bandwidth off-chip memory store.
Figure 11 is a block schematic diagram illustrating ns of a shared memory computing
architecture (1300) optimised for bound worst case ion time, employing
embodiments of figures 1 and 3 according to a preferred embodiment of the present invention.
The shared memory computing architecture (1300) ses:
a first system interconnect (1350) of the type bed in figure 1;
an on-chip random access memory store (1370) comprising two interconnect target ports,
in which the first interconnect target port is connected to the first system (1350)
interconnect;
at least two sub-computing device (1330, 1340), in which:
the first (1330) of the at least two mputing devices (1330, 1340) comprises:
a first local interconnect (1310) sing:
a unidirectional interconnect (1311) for transporting memory
transfer requests; and
a unidirectional interconnect (1312) for transporting the
corresponding memory transfer responses;
Signed B. Gittins
14 June 2020 34 NZ IP No. 716954
a first interconnect master (1331) connected to a local interconnect (1310)
of the sub-computing ;
a unidirectional interconnect bridge {1351.a, 1352a} comprising two
ports, in which:
the first port is connected to the first system interconnect (1350);
the second port is connected to a local interconnect (1310) of the
sub-computing ; and
in which the first interconnect master (1331) is adapted to issue memory
transfer requests to the on-chip random access memory store (1370) over
the unidirectional interconnect bridge {1351.a, 1352.a};
the second (1340) of the at least two sub-computing devices (1330, 1340)
comprises:
a first local interconnect (1315) comprising:
a unidirectional onnect (1316) for transporting memory
transfer requests; and
a unidirectional interconnect (1317) for transporting the
corresponding memory transfer responses;
a first interconnect master (1341) connected to a local interconnect (1315)
of the sub-computing device; and
a unidirectional interconnect bridge {1351.b, 1352.b} comprising two
ports, in which:
the first port is connected to the first system interconnect (1370);
the second port is connected to a local interconnect of the puting
device (1315); and
in which the first onnect master (1341) is adapted to issue memory
transfer requests to the on-chip random access memory store (1370) over
the unidirectional interconnect bridge {1351.b, 1352.b}.
The shared memory computing ecture (1300) further comprises:
an p random access memory store (1370) comprising at least two interconnect
target ports, in which:
the first port:
has a data path of D-bits in width, the value of D being equal to 128;
is adapted to sustain a throughput of one D-bit wide memory transfer
Signed B. Gittins
14 June 2020 35 NZ IP No. 716954
t per clock cycle; and
is adapted to sustain a throughput of one D-bit wide memory transfer
response per clock cycle; and
the second port:
has a data path of E-bits in width, the value of E being equal to 16;
is d to sustain a throughput of one E-bit wide memory transfer
request per clock cycle; and
is adapted to sustain a throughput of one E-bit wide memory er
response per clock cycle;
a first on-chip shared memory interconnect (1350) of the type described in figure 1
which:
has a data path of D-bits in width;
is exclusively connected to the first port of the at least two interconnect target
ports of the on-chip random access memory (1370);
is adapted to sustain a throughput of one D-bit wide memory transfer request per
clock cycle to the on-chip random access memory (1370);
is adapted to sustain a throughput of one D-bit wide memory er response per
clock cycle; and
has at least two cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) connected to
it, each cache module comprising:
a master port with a D-bit wide data path which is connected to this
interconnect (1350); and
a target port;
and a second on-chip shared memory interconnect (1360) of the type described in figure
1 which:
has a data path of E-bits in width;
is exclusively connected to the second port of the at least two interconnect target
ports of the on-chip random access memory (1370);
is adapted to sustain a peak throughput of one E-bit wide memory transfer request
per clock cycle to the on-chip random access memory (1370); and
is adapted to sustain a peak throughput of one E-bit wide memory transfer
response per clock cycle; and
has at least two interconnect masters (1381, 1382) connected to it.
ably the dual-port on-chip random access store (1370) is internally sed of 8 dualport
16-bit wide p random access stores arranged in parallel. The first port is adapted to
Signed B. Gittins
14 June 2020 36 NZ IP No. 716954
receive memory transfer requests with data lengths ranging from 16 to 128-bits in length, in
multiples of 16 bits. The second part is d to receive 16 bit memory transfer ts. This
configuration is well suited to cost effectively creating a memory store that can sustain the eed
bandwidth ements of a relatively large number of lower bandwidth peripherals while
permitting interconnect masters (1331) and (1341) relatively high dth tency access
to that data.
In an alternate preferred ment of the present invention, the value of D is equal to 256 and
the value of E is equal to 256 and the dual-port on-chip random access store (1370) is internally
comprised of 16 dual-port 32-bit wide on-chip random access stores arranged in parallel. This
configuration is well suited to supporting the wire speed of higher bandwidth peripherals.
Preferably both the first (1350) and second (1360) on-chip shared memory interconnects employ
timeslot based arbitration s; and at least two timeslots of the first on-chip shared memory
interconnect each have a timeslot length of one clock cycle in length.
It is further preferred that both interconnects (1350) and (1360) only employ timeslots that have
a duration of 1 clock cycle in length, and in which the data-path width is adapted so that it is
sufficiently wide to transmit an entire memory transfer request and/or its corresponding memory
transfer response in 1 clock cycle. This later configuration is particularly desirable, when
compared against a configuration in which both interconnects employ timeslots of 2 clock
cycles, a uration which would double the worst case access latency for an interconnect
master directly connected to the interconnect seeking to gain access to a ot. To place this
result in t, several commercial off the shelf average case execution time optimised multi-
core computer architectures employ bus protocols, such as AMBA AHB 2, which permit
memory transfer ts to block the bus for well over 10 clock cycles.
This later configuration, in which each timeslot is 1 clock cycle in length, is extremely desirable
even if one or more of the interconnect masters can not sustain high rates of memory transfer
requests. This is because this configuration achieves the lowest worst case access latencies at the
point of contention between interconnect masters.
The computing architecture (1300) r comprises:
at least one processor core (1331, 1341);
a peripheral (1383), comprising:
a first interconnect target port (1381.t1) which is connected by wires (1384, 1385)
Signed B. Gittins
14 June 2020 37 NZ IP No. 716954
to the first p shared memory interconnect (1350); and
a first interconnect master port (1381.m1) which is connected to the second onchip
shared memory interconnect (1360);
in which:
at least one (1331, 1341) of the at least one processor cores (1331, 1341) can issue
a memory transfer request over the first on-chip shared memory interconnect
(1350) to the peripheral (1383);
the peripheral (1383) can store data in the on-chip random access memory over
the second system interconnect (1360); and
the at least one (1331, 1341) of the at least one processor cores (1331, 1341) can
read that data.
The computing architecture (1300) further comprises:
a first peripheral interconnect (1355) of the type described in figure 3 for transporting
memory transfer requests and their corresponding responses;
a peripheral , comprising:
a first interconnect target port t1) which is connected to the first eral
interconnect (1355);
a second interconnect target port (1381.t2) which is connected to the first
peripheral interconnect (1355); and
a first interconnect master port (1381.m1) which is connected to one (1360) of the
at least two on-chip shared memory interconnects (1350, 1360);
in which:
at least one of the at least one processor cores (1331, 1341) can issue a memory
transfer request over the first peripheral interconnect (1355) to the peripheral
(1381);
the peripheral (1381) can store data in the on-chip random access memory (1370)
over the second system interconnect (1360); and
the at least one of the at least one sor cores (1331, 1341) can read that data.
Preferably the eral interconnect is adapted to transport each memory transfer t in 1
clock cycle and each corresponding memory transfer response in 1 clock cycle. Preferably the
data-path width of the peripheral interconnect (1355) is less than the data-path width of the
second interconnect (1350, 1360).
Preferably there is a second peripheral interconnect (not illustrated) adapted to enable the
Signed B. Gittins
14 June 2020 38 NZ IP No. 716954
processor cores (1331, 1341) to communicate with peripherals that do not have an interconnect
master ace. The use of a second peripheral interconnect for peripherals that do not have
interconnect master aces is ularly advantageous because it permits many relatively
low bandwidth peripherals to be placed and routed on the chip some distance away from the
memory store (1370) which is used by relatively high bandwidth interconnect-master
peripherals.
The ing architecture (1300) further comprises:
a eral (1382), comprising:
a first interconnect target port (1382.t1) which is connected to the first peripheral
interconnect (1355);
a first interconnect master port (1382.m1) which is connected to one (1360) of the
at least two on-chip shared memory interconnects;
in which:
at least one of the at least one processor cores (1331, 1341) can issue a memory
transfer request over the first peripheral onnect (1355) to the peripheral
(1381);
the peripheral (1381) can store data in the on-chip random access memory (1370)
over the second system interconnect (1360); and
the at least one of the at least one processor cores (1331, 1341) can read that data.
Preferably the two interconnect bridges ({1351.a, 1352.a}, {1351.b, 1352.b}) are cache modules.
The use of cache modules is highly desirable as it permits interconnect masters with relatively
narrow data path widths, such as 32-bit processor cores (1331, 1341), to take better advantage of
interconnects (1350) and shared on-chip es (1370) with vely wide data paths (e.g.
128-bit). For example, if there are sixteen 32-bit processor cores, in which each core has a
private cache module that is attached to the same interconnect (1350), increasing the data-path
width of that interconnect (1350) from t to 512-bit or higher ses the amount of data
prefetched by read memory transfer requests issued by each cache module to that interconnect
(1350). This in turn tends to result in ed masking of the worst case 16 clock cycle access
latencies between 2 consecutive memory transfer requests issued by a cache module to that
shared memory (1370) over that interconnect (1350) for that caches’ processor core.
Preferably at least 2 of the at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) which
are connected to the first on-chip shared memory interconnect (1350) maintain cache-coherency
with each other ({1351.a, 1352.a}, {1351.b, 1352.b}) with zero timing interference to unrelated
Signed B. Gittins
14 June 2020 39 NZ IP No. 716954
memory transfer requests received on the target port of those at least 2 cache modules ({1351.a,
1352.a}, {1351.b, 1352.b}). These properties simplify the worst case execution time analysis of
tasks running on cores (1331, 1341) that access their private cache modules (({1351.a, 1352.a},
{1351.b, 1352.b}).
Preferably at least 2 of the at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) which
are connected to the first on-chip shared memory interconnect (1350) operate in a cachecoherency
group that maintains cache-coherency between each other and also maintains cache
coherency against the write memory transfer requests (1399) issued to at least one of the other
ports of the on-chip random access memory (1370). For example in a 16 core system (1331,
1341, …) with 64 interconnect-master peripherals (1381, 1382, 1383, …), a cache-coherency
group could include 2 out of 16 sor cores, and 10 out of 64 interconnect-master
peripherals. This reduces the upper-bound rate of cache coherency traffic that must be processed
by the cache s for those 2 cores, resulting in icant power savings and lower-cost
s look-up mechanisms in the cache modules. e.g. this cache coherency group would only
need to sustain looking up to 12 memory transfer requests every 16 clock cycles instead of
g up to 32 memory er requests every 16 clock cycles.
Preferably at least 2 of the at least 2 cache modules ({1351.a, 1352.a}, {1351.b, 1352.b}) which
are connected to the first on-chip shared memory interconnect (1350) e in a cachecoherency
group that maintains cache-coherency between each other are update type of caches
that snarf each others write requests. This is particularly advantageous when ming worst
case execution time (WET) analysis of tightly coupled tasks in shared memory ectures. Let
us consider the situation in which the first core (1341) requests a ce lock and the second
core (1331) releases that same ce lock. The cache snarfing mechanisms can be adapted to
guarantee that all write requests issued by the core (1341) before that core (1341) released the
resource lock are processed by the snarfing cache of core (1331) before that core (1331) is
granted that shared ce lock. This ensures that each cache-line that was t in the cache
of core (1331) before that core (1331) requested a shared memory resource lock are coherent
with the write memory transfer requests issued by core (1341). This then avoids the need to
consider which cache-lines, if any, were updated by other tasks running on other cores in the
cache coherency group that are sharing a common region of memory. This can result in a very
significant reduction in upper-bound WCET analysis complexity. It can also result in tighter
upper-bound WCET analysis times for those tasks. By way of ison, the use of an
eviction type of cache would result in some cache-lines that were present in the cache of core
(1331) before the resource lock was requested being evicted so as to maintain coherency with the
Signed B. Gittins
14 June 2020 40 NZ IP No. 716954
write memory transfer requests of core (1341). This would require the upper-bound WCET
analysis tools to identify which cache-lines could potentially have been d so as to make
pessimistic timing assumptions about access to those cache-lines.
The use of p dual port memory (1370) is particularly well suited for supporting a relatively
low number of high-bandwidth bus masters such as processor cores (1331, 1341) connected to
the first onnect (1350), and a larger number of peripherals (for example, 64 peripherals)
operating at their wire speed which are connected to the second interconnect (1360). In
particular, increasing the number of peripherals, say from 64 to 128, does not reduce the
bandwidth, or increase the access latencies of processor cores (1331), (1341) to the shared
memory (1370). Furthermore, one or more timeslots of the second interconnect (1360) can be
allocated to high bandwidth peripherals (say 1 gigabit/s Ethernet peripherals) over lower
dth peripherals (say 10 Megabit/s Ethernet peripherals) which need only be ted one
timeslot to meet their wire speed bandwidth requirements.
In some situations, it will be desirable for one or more of the M onnect bridges ({1351.a,
1252.a}, {1351.b, 1252.b}) to operate as an interconnect protocol transcoding bridge in which
the protocol to transcode is a bus interconnect protocol such as ARM AMBA AHB [2].
The time-analysable rocessor interrupt controller (1392) with software maskable interrupt
lines is adapted to map one or more interrupt lines between the peripherals (1381, 1382) and one
or more interconnect masters (1331, 1341).
The shared memory computing device (1300) further comprises:
N cache modules ({1351.a, 1352.a}, {1351.b, }), where the value of N is 2, each
cache module comprising:
a master port:
a target port; and
a finite state machine that employs an update-type cache coherency policy;
N processor cores (1331, 1341), in which each core:
is assigned a different one of the N fully associative cache modules ({1351.a,
1352.a}, {1351.b, }) as its private cache; and
in which:
the execution time of memory er requests issued by each of the N sor
cores (1331, 1341) is not modified by the:
unrelated memory er requests issued by any of the other N processor
Signed B. Gittins
14 June 2020 41 NZ IP No. 716954
cores (1331, 1341); and
unrelated memory transfer requests issued by at least one other
interconnect master (1381, 1382, 1383);
one {1351.a, 1352.a} of the N cache modules ({1351.a, 1352.a}, {1351.b,
1352.b}) can maintain cache coherency against a different one of the N cache
modules {1351.b, 1352.b}; and
that cache module {1351.a, 1352.a} can maintain cache coherency against
memory transfer requests issued by the at least one interconnect master (1381,
1382, 1383) by monitoring wire (1399).
Figures 12 to 14 illustrate alternative interconnect designs ing to preferred embodiments
of the present invention. These alternative onnect designs can be employed to implement
the interconnect (720) of figure 6 and the 3 interconnects (1350), (1360) and (1355) of figure 11.
Figure 12 is a block schematic diagram rating portions of a shared memory computing
architecture (1700) for preferred embodiments of the present invention. The shared memory
computing architecture (1700), comprises:
M interconnect nodes (1701, 1702, 1703, 1704), where the value of M is 4, each
interconnect node sing:
an egress port; and
an ingress port;
a singular interconnect node (1705) comprising:
an egress port; and
an ingress port;
a first Mx1 interconnect (1706) for transporting memory transfer requests and their
corresponding responses, comprising:
M ctional ports ({1711.i, 1711.e}, i, 1712.e}, {1713.i, 1713.e},
{1714.i, 1714.e}), each comprising:
an ingress port (1711.i, 1712.i, 1713.i, 1714.i) which is ted to the
egress port of a different one of the M interconnect nodes (1701, 1702,
1703, 1704); and
an egress port (1711.e, 1712.e, 1713.e, 1714.e), which is connected to the
ingress port of a ent one of the M interconnect nodes (1701, 1702,
1703, 1704);
a ar bidirectional port ({1715.i, 1715.e}) comprising:
an egress port e) which is connected to the ingress port of the
Signed B. Gittins
14 June 2020 42 NZ IP No. 716954
singular onnect node (1705); and
an ingress port i) which is connected to the egress port of the
singular onnect node (1705);
a parallel-in, serial-out (PISO) M input port x 1 output port shift register (1707)
with M stages (1751, 1752, 1753, 1754), in which:
for each stage I of the M stages: that stage is connected to the egress port
of the interconnect node I of M interconnect nodes ({1751, 1711.i, 1701},
{1752, 1712.i, 1702}, {1753, 1713.i, 1703}, {1754, 1714.i, ; and
the output of stage 1 (1751) is ted to the egress port (1715.e) of the
singular port of the interconnect;
a serial-in, parallel-out (SIPO) 1 input port x M output port module (1708), in
which the input is connected to the ingress port of the singular port of the
interconnect (1715.i); and
an arbiter and decoder module (1716) which is adapted to control the PISO Mx1
shift register (1707) and the SIPO 1xM module (1708).
In this pedagogical description, the value of W is set as the number of bits to transport a memory
transfer request of the m length for that interconnect and its corresponding response in
one clock cycle. An idle memory transfer request is encoded as W bits with the binary value of
zero. The arbiter and decoder module (1716) controls: the select input of each of the 2 data
input, 1 data output multiplexers (1720, 1721, 1272, 1723, 1725, 1726, 1727, 1728), each
multiplexer having a data-path of W bits; the select input of the optional 2 data input, 1 data
output multiplexer (1729) which has a data-path of W bits; the enable input of each of the
registers (1730, 1731, 1732), each er having a data-path of W bits; the enable input of each
of the optional registers (1740, 1741, 1742, 1743, 1744), each register having a data-path of W
bits; the enable input of register (1746) which has a ath of W bits, the enable input of each
of the optional registers (1745, 1747), each register having a data-path of W bits.
The interconnect arbiter and decoder module (1716) receives as inputs the control signals (not
illustrated) received on ports (1711.i, 1712.i, 1713.i, , 1715.i). Preferably the arbiter and
decoder module (1716) implements at least one scheduling policy that considers the state of
those input control signals.
The interconnect arbiter and decoder module (1716) generates one or more control signals as
outputs (not illustrated) that are supplied as output on ports e, 1712.e, 1713.e, 1714.e,
1715.e). One or more of these controls signals released as output on ports (1711.e, 1712.e,
Signed B. Gittins
14 June 2020 43 NZ IP No. 716954
1713.e, , 1715.e) are used to inform the interconnect nodes (1701, 1702, 1703, 1704,
1705) if it has been granted a timeslot on the interconnect to issue a memory er request (if
it is a interconnect master); and to provide relevant meta-data ated with a memory transfer
request sent to that interconnect node (if it is a interconnect target).
The following text s the use of the optional registers (1740, 1741, 1742) and the optional
registers (1745, 1747).
This paragraph describes the parallel-in, serial-out (PISO) M input port x 1 output port shift
register module (1707) in greater detail. The data-path of each of the ingress ports (1711.i,
1712.i, 1713.i, ) is gated by the multiplexers (1720, 1721, 1722, 1723) respectively. The
data path of each of the egress ports of (1711.e, 1712.e, 1713.e, 1714.e, 1714.s) is gated by the
multiplexers (1725, 1726, 1727, 1728, 1729) respectively. In the fourth stage (1754) of the
parallel-in, serial-out (PISO) M input port x 1 output port shift register (1707), the binary value 0
is supplied as input to the first data port of multiplexer (1737). The output of multiplexer (1723)
is supplied as input to the second data port of multiplexer (1737). The output of lexer
(1737) is supplied as data input to the register (1732). In the third stage (1753), the output of
register (1732) is supplied as input to the first data port of multiplexer (1736). The output of
multiplexer (1722) is supplied as input to the second data port of multiplexer (1736). The output
of multiplexer (1736) is supplied as data input to the register (1731). In the second stage ,
the output of register (1731) is supplied as input to the first data port of multiplexer (1735). The
output of multiplexer (1721) is supplied as input to the second data port of multiplexer (1735).
The output of multiplexer (1735) is ed as data input to the register (1730). In the first stage
(1753), the output of register (1730) is supplied as input to the first data port of multiplexer
(1717). The output of lexer (1720) is supplied as input to the second data port of
multiplexer (1717). The output of multiplexer (1717) is released as the egress output of port
(1715.e).
This paragraph describes the serial-in, el-out (SIPO) 1 input port x M output port module
(1708) in greater detail. The output of interconnect node (1705) is received on ingress port
(1715.i) and is supplied to the data input of registers (1740) and (1745). The output of the W-bit
wide register (1740) is gated by multiplexer (1725). The output of W-bit wide er (1745) is
supplied to the data input of registers (1741) and (1746). The output of the W-bit wide register
(1741) is gated by lexer (1726). The output of W-bit wide register (1746) is supplied to
the data input of registers (1742) and (1747). The output of the W-bit wide register (1742) is
gated by multiplexer (1727). The output of W-bit wide register (1747) is supplied is gated by
Signed B. Gittins
14 June 2020 44 NZ IP No. 716954
multiplexer (1728).
ably the arbiter and decoder module (1716) is adapted to employ the ingress and egress
gating to selectively block the outputs and inputs of interconnect nodes (1701, 1702, 1703, 1704)
respectively. rmore, the gating multiplexers can be used by the r and decoder
module (1716) to enforce access controls. The gating multiplexers can be implemented using
AND gates without loss of generality.
In a preferred embodiment of the present invention, the interconnect node (1705) is an
interconnect master, and the interconnect nodes (1701, 1702, 1703, 1704) are interconnect
s. In this embodiment, memory transfer requests are transported over the first serial-in,
parallel-out (SIPO) 1 input port x M output port module (1708) and memory transfer responses
are transported over the parallel-in, serial-out (PISO) M input port x 1 output port shift er
module (1707). Preferably each timeslot has a length of 1 clock cycle, onnect master
(1705) is adapted to issue a new memory transfer request every clock cycle and each
interconnect target (1701, 1702, 1703, 1704) is adapted to issue a memory transfer response once
every 4 clock cycles.
Preferably each interconnect target (1701, 1702, 1703, 1704) is assigned one timeslot, and the
interconnect master issues memory transfer requests in a round-robin fashion to each of the
interconnect targets (1701, 1702, 1703, 1704). In a preferred embodiment of the present
invention, the register (1740) is replaced with a 2 stage FIFO, the register (1741) is replaced with
a 1 stage FIFO, the optional registers (1742) and (1743) are both replaced with a 1 stage FIFO,
and the optional registers (1745) and (1747) are not used. In this case, the memory transfer
request for each timeslot (for 1701, 1702, 1703, 1704) is loaded into its corresponding FIFO
(1740, 1741, 1742, 1743). The concurrent output of each FIFO (1740, 1741, 1742, 1743) is
delayed by 1 clock cycle for each delay register (1745, 1746, 1447) that is employed. In this
illustration, only one delay register (1746) is employed, and so the output of each FIFO (1740,
1741, 1742, 1743) is released in parallel in the second timeslot. In this way a new memory
transfer request can be issued every clock cycle in a round robin scheme with 4 timeslots,
gh it takes 5 clock cycles to transport each of those memory transfer requests to the 4
interconnect targets (1701, 1702, 1703, 1704).
In an alternate preferred ment of the present invention, the interconnect node (1705) is an
interconnect , and the interconnect nodes (1701, 1702, 1703, 1704) are onnect
masters. In this embodiment memory er requests are transported over the parallel-in,
Signed B. Gittins
14 June 2020 45 NZ IP No. 716954
serial-out (PISO) M input port x 1 output port shift register module (1707) and memory transfer
responses are transported over the first serial-in, parallel-out (SIPO) 1 input port x M output port
module (1708). Preferably each timeslot is 1 clock cycle in length, the interconnect s
(1701, 1702, 1703, 1704) are adapted to issue a memory transfer request once every 4 clock
cycles and the interconnect target (1705) is d to receive a memory transfer request each
clock cycle and issue a memory transfer response each clock cycle.
Preferably module (1707) is adapted to transporting just memory transfer requests and module
(1708) is adapted to transport memory er responses along with a copy of their
corresponding memory transfer requests to facilitate cache ncy for update-type snooping
caches (1705, 1715, 1744, 1729, , 1704).
Figure 13 is a flow-chart (1800) illustrating the steps of interconnect master (1702) issuing a
single memory transfer request over interconnect (1706) to interconnect target (1705) ing
to a preferred embodiment of the present ion. The process described in flow chart (1800)
will not use the optional registers (1740, 1741, 1742, 1743, 1744, 1745, 1474), and the 4 memory
transfer ses within a statically scheduled round-robin period of 4 clock cycles will not be
buffered and released in parallel. In this way, only PISO module (1707) is enting a
timeslot based scheme, but the SIPO module (1708) employs a best-effort scheduling scheme.
In clock cycle 1 (1801):
In step 1820, the interconnect target (1705) receives the output of PISO module (1707)
which contains an idle memory transfer request. The onnect target (1720) generates
an idle memory transfer response incorporating a copy of its corresponding idle memory
transfer request. The value of that memory transfer response is supplied to interconnect
In step 1830, the value of the memory transfer response generated in step 1820 is
received as input on port 1715.i and supplied to the input of the SIPO module (1708) and
will be relayed across the 2 stages of that SIPO . The first stage includes the
modules , (1726) and (1746). The second stage includes the modules (1727) and
(1728). The interconnect arbiter and decoder module (1716) generates control signals on
ports (1711.e), (1712.e), (1713.e), and (1714.e) granting the next ingress timeslot of the
interconnect (1706) simultaneously to each of the interconnect masters (1701), (1702),
(1703) and (1704) respectively.
In step 1810, the value of the control signal generated by the SIPO module (1707) in step
1830 is received as input by the interconnect master (1702).
Signed B. Gittins
14 June 2020 46 NZ IP No. 716954
In clock cycle 2 (1802):
In step 1821, the interconnect target (1705) receives the output of PISO module (1707)
which contains an idle memory transfer t. The onnect target (1720) generates
an idle memory transfer response incorporating a copy of its corresponding idle memory
transfer request which was received in step 1820. The value of that memory transfer
response is supplied to the interconnect (1708).
In step 1811, the interconnect master (1702) tes a memory transfer request
addressed to interconnect target (1705) the value of which is supplied to interconnect
(1708).
In step 1831, the value of the memory transfer response generated in step 1821 is
received as input to the SIPO module (1708) and will be relayed across the 2 stages of the
SIPO module. The value of the memory transfer request generated in step 1811 is
received as input to the second stage (1752) of the PISO module (1701) and stored in
register (1730). Each of the other 3 interconnect nodes (1701), (1703), and (1704)
generate an idle memory transfer response which is received as input to the first stage
(1751), third stage (1753) and fourth stage (1754) respectively.
In clock cycle 3 :
In step 1832, the value of the memory transfer t stored in er (1730) is
released as output of the PISO module (1707) and supplied as input to the interconnect
target (1705).
In step 1822, the interconnect target (1705) es the output of PISO module (1707)
which contains the value of the memory transfer request generated as output by the
interconnect master (1702) in step 1811 and begins to processes that request. The
interconnect target (1720) generates an idle memory transfer response incorporating a
copy of its corresponding idle memory transfer request which was received in step 1821.
The value of that memory er response is supplied to the interconnect (1708).
In step 1832, the value of the memory transfer response generated in step 1822 is
received as input to the SIPO module (1708) and will be relayed across the 2 stages of the
SIPO module.
In clock cycle 4 (1804):
In step 1823, the interconnect target (1705) receives the output of PISO module (1707)
which ns an idle memory transfer request. The interconnect target (1720) generates
a memory transfer response incorporating a copy of its corresponding idle memory
Signed B. Gittins
14 June 2020 47 NZ IP No. 716954
transfer request which was received in step 1822. The value of that memory transfer
response is supplied to the interconnect (1708).
In step 1833, the value of the memory transfer response generated in step 1823 is
received as input to the SIPO module (1708) and will be relayed across the 2 stages of the
SIPO module. That value of that memory transfer response received as input to the SIPO
module (1708) is directly released as output over port (1712.e) to interconnect master
(1702).
In step 1812, the interconnect master (1702) receives the value of the memory transfer
response sent in step 1832 corresponding to the interconnect master’s (1702) memory
transfer request issued in step 1811.
In this way we have illustrated an interconnect master (1702) issuing a memory transfer request
to interconnect target (1705) and receiving its corresponding memory transfer response over
interconnect (1706).
Preferably, the shared memory computing architecture (1700) further comprises a second serialin
, parallel-out (SIPO) 1 input port x M output port (only port (1714.s) is illustrated) module
(1709) for transporting cache coherency traffic, in which:
the input is connected to the ingress port (1715.i) of the singular port {1715.i, 1715.e} of
the interconnect (1706); and
the arbiter and decoder module (1716) controls the second SIPO 1xM module.
Preferably the first SIPO (1708) and second SIPO (1709) employ different routing es. Let
us consider an example where interconnect nodes (1701, 1702, 1703, 1704) are interconnect
masters. In this example, the r and r module (1716) selectively routes the value of
each memory er response back to the interconnect master that issued the corresponding
memory er request on the first SIPO (1708). However, for the second SIPO (1709), the
arbiter and decoder module (1716) forwards the value of each and every memory transfer
response (and its corresponding memory transfer t data) to the snoop port (only 1704.s
illustrated) of all interconnect s. See the description of figure 20 for an example encoding
a memory er response with its corresponding memory transfer request. In this way the
snooping of write memory transfer requests can be performed when monitoring just the
interconnect transporting memory er ses. Preferably cache coherency groups are
ed so that memory transfer responses (and their ponding memory transfer request
data) are selectively forwarded ing to the cache coherency group policies in force on that
interconnect (1706).
Signed B. Gittins
14 June 2020 48 NZ IP No. 716954
So in this way we have rated a ctional interconnect (1706) for transporting memory
transfer requests and their corresponding memory er responses, comprising:
a unidirectional interconnect to transport memory transfer requests (1707);
a unidirectional onnect to transport memory transfer responses (1708, 1709) which
is adapted to transport memory transport responses that includes a copy of the
corresponding memory transfer request.
In an alternate preferred embodiment, the interconnect node (1705) is an interconnect bridge. In
some situations, it will be desirable for the interconnect bridge (1705) to operate as an
interconnect protocol transcoding bridge in which the protocol to transcode is a bus interconnect
protocol such as ARM AMBA AHB [2].
Figure 14 is a block schematic diagram illustrating portions of a shared memory computing
architecture (1900), employing embodiments of figures 3 and 12 for preferred embodiments of
the present invention. Shared memory ing ecture (1900) comprises:
16 interconnect masters (1901 to 1916);
1 interconnect target (1917);
a composite interconnect {1960, 1961, 1962, 1963, 1964} comprising:
four terconnects (1960, 1961, 1962, 1693) of the type described in figure
12, each sub-interconnect having 4 interconnect master ports ({1921 to 1924},
{1925 to 1928}, {1929 to 1932}, {1933 to 1936}) and 1 output port (1941, 1942,
1943, 1944);
one sub-interconnect (1964) having 4 input ports (1951 to 1954) and 1
interconnect target port (1955);
in which:
the 4 interconnect masters (1901) to (1904) are connected to sub-interconnect
(1960) on ports (1921) to (1924) respectively;
the 4 onnect masters (1905) to (1908) are ted to terconnect
(1961) on ports (1925) to (1928) respectively;
the 4 interconnect masters (1909) to (1912) are connected to sub-interconnect
(1962) on ports (1929) to (1932) respectively;
the 4 interconnect masters (1913) to (1916) are connected to sub-interconnect
(1963) on ports (1933) to (1936) respectively;
the 4 output ports (1941, 1942, 1493, 1944) of the 4 sub-interconnects (1960,
1961, 1962, 1963) are connected to the 4 input ports (1951, 1952, 1953, 1954) of
Signed B. Gittins
14 June 2020 49 NZ IP No. 716954
the sub-interconnect (1964) respectively;
the interconnect target (1917) is connected to sub-interconnect (1964) on port
(1955);
Preferably, the composite interconnect {1960, 1961, 1962, 1963, 1964} employs a statically
scheduled timeslot scheme with 16 timeslots, one for each of the interconnect masters (1901 to
1916).
In one preferred embodiment of the t invention, the arbiter and decoder modules of the
five sub-interconnects (1960, 1961, 1962, 1963, 1964) are trivially substituted with a single
arbiter and decoder module controlling the ite interconnect {1960, 1961, 1962, 1963,
1964}. In an alternate preferred embodiment of the present invention, the five arbiter and
decoder s in sub-interconnects (1960, 1961, 1962, 1963, 1964) are adapted to co-ordinate
their activities to create a single logical finite state machine (not illustrated) lling the
composite interconnect {1960, 1961, 1962, 1963, 1964}.
Figure 14 illustrates that different types of interconnects can be combined together to create a
ite interconnect without a loss of generality.
In an ate embodiment of the present invention, the interconnect nodes (1901 to 1916) are
interconnect targets and the onnect node (1917) is an interconnect bridge which permits
one or more onnect masters (not illustrated) to issue memory transfer requests over that
interconnect bridge (1917) to the onnect targets (1901 to 1916). ably the composite
interconnect {1960, 1961, 1962, 1963, 1964} further comprises a means to enforce an access
control policy between interconnect masters and interconnect targets. It is further preferred that
the means to enforce an access control policy is adapted to ensure that no more than one
interconnect master can issue memory transfer requests to a given interconnect target (1901 to
1916). In this way the access control policy guarantees that a memory transfer request to that
interconnect target will not be delayed by other interconnect masters.
Figure 15 is a high-level block schematic diagram illustrating a cache module (1200) for
preferred embodiments of the t invention. Cache module (1200) comprises:
an interconnect target port (1210);
an interconnect master port (1215);
two snoop ports (1212) and ;
a first in first (FIFO) queue (1214) to store cache coherency being adapted to store snoop
Signed B. Gittins
14 June 2020 50 NZ IP No. 716954
traffic ed on the two snoop ports (1212) and (1213);
a FIFO queue (1211) to store memory transfer requests received on the interconnect
target port (1210) being adapted to store:
at least one outstanding write memory transfer request; and
at least one outstanding read memory transfer request;
a dual-port cache-line store (1230) being adapted to store at least two cache-lines;
a FIFO queue (1235) being adapted to queue write memory transfer events;
a FIFO queue (1236) being adapted to queue read memory transfer ;
a queue (1237) being adapted to queue the order to process read and write memory
transfer events queued in the FIFO queues (1235) and (1236);
a FIFO queue (1238) called a write buffer (1238) being d to store the data of
cache-lines that have been evicted from the line store (1320) and are to be written
over the interconnect master port (1215);
a dual port address tag finite state machine (1231) comprising:
a first target port;
a second target port;
a means to store tags that associate cache-lines stored in the cache-line store
(1230) with their respective al and/or physical) addresses;
a means to search for tags by their (virtual and/or physical address); and
a means to search for tags by their index within the cache-line store (1230);
a triple port status tag finite state machine (1232) comprising:
a first target port:
a second target port;
a third target port;
a means to store tags that ate the cache-lines stored in the cache-line store
(1230) with their status and other related ation, including:
which cache-lines are allocated;
which cache-lines are in the process of being evicted;
optionally which cache-lines are in the process of being cleaned;
which portions of the cache-lines are valid; and
which portions of the cache-lines are dirty; and
a means to process commands received on the first, second and third target ports
in a way that ensures internal consistency of the t of the tags and the
responses to the concurrently issued commands;
an interconnect (1239) that is work preserving comprising:
a high priority master port;
Signed B. Gittins
14 June 2020 51 NZ IP No. 716954
a low priority master port; and
a target port connected the second port of the dual-port cache-line store (1230);
a front-side FSM (1220) comprising:
a master port connected to the low priority master port of the interconnect (1239);
a bidirectional communications channel with the FIFO queue (1211);
a bidirectional communications channel with the interconnect target port ;
a unidirectional communications channel with the queuing FSM (1221);
a bidirectional communications channel with the back-side FSM ;
a master port connected to the second master port of the dual port address tag
finite state machine (1231); and
a master port connected to the second target port of the triple port status tag finite
state machine (1232);
a queuing FSM (1221) comprising:
a bidirectional communications channel with the front-side FSM (1220);
a bidirectional communications channel with the back-side FSM (1222);
two master ports ted to the FIFO queue (1235) being adapted to queue
write memory transfer events;
two master ports ted to the FIFO queue (1236) being adapted to queue read
memory transfer events; and
two master ports ted to the FIFO queue (1237) being d to queue the
order to process read and write memory transfer events.
a back-side FSM (1222) sing:
a master port connected to the high ty master port of the interconnect
(1239);
a bidirectional ications channel with the queuing FSM (1221);
a bidirectional communications channel with the side FSM (1220);
a master port connected to the third target port of the triple port status tag finite
state machine (1232);
two master ports connected to the write buffer (1238); and
a bidirectional communications channel with the interconnect master port (1215);
a snoop FSM (1223) comprising:
a bidirectional communications channel with the FIFO queue (1214);
a bidirectional communications channel with the back-side FSM (1222);
a master port connected to the first target port of the dual port address tag finite
state machine (1231);
Signed B. Gittins
14 June 2020 52 NZ IP No. 716954
a master port connected to the first target port of the triple port status tag finite
state machine (1232); and
a master port connected to the first port of the ort cache-line store (1230).
Figure 16 is a flow-chart (1400) illustrating the steps of the front-side FSM (1220) of figure 15
according to a preferred embodiment of the present invention. The process described in flow
chart (1400) is a functional ption which executes over 1 or more clock cycles.
In step 1401, start the front-side FSM s.
In step 1402, perform a blocking read to fetch the next memory transfer request from the ingress
FIFO queue (1211). By blocking, it is meant that the read t will wait until a memory
transfer request is retrieved, even if the FIFO queue (1211) is initially empty when the read
request is issued.
In step 1403, issue a blocking command to the address tag finite state machine (1321) to search
for a cache-line by the address encoded in the memory transfer t ed in step 1402. If
the cache-line is present, then issue a blocking command to the status tag finite state e
(1322) to: (a) ve the status details including which portions of that cache-line are valid, (b)
request the status details of the least recently used cache-line, and (c) ask if there are any
currently unallocated cache-lines.
In step 1404, if the memory transfer request received in step 1402 is a read request go to step
1405 otherwise go to step 1415.
In step 1405, if the memory transfer request received in step 1402 corresponds to a cache-line
that is present in the cache-line store (1230) and the requested content is present in that cacheline
then go to step 1413 otherwise go to step 1406.
In step 1406, if the read memory transfer request received in step 1402 corresponds to a cacheline
that is present in the cache-line store (1230) but a requested portion of that cache-line is not
present/valid then go to step 1412 otherwise go to step 1407.
In step 1407, if there is at least one cated cache-line ble in the cache-line store
(1230), then go to step 1411, otherwise go to step 1408.
In step 1408, issue a non-blocking command to the status tag finite state machine (1232)
marking the least recently used cache-line as being in the process of being evicted.
In step 1409, if the least recently used cache-line to be evicted is dirty and therefore must be
written out of the cache module (1200) then go to step 1410, otherwise go to step 1411.
In step 1410, issue a non-blocking d to the queuing FSM (1221) requesting an eviction
of the dirty cache-line. Wait for a notification from the back-side FSM (1222) indicating a write
transaction has completed.
Signed B. Gittins
14 June 2020 53 NZ IP No. 716954
In step 1411, issue a ng command to the status tag finite state machine (1232) requesting
the allocation of an unallocated cache-line and receive the index for that newly allocated cacheline.
In step 1412, issue a non-blocking d to the queuing FSM (1221) to requesting a read
memory transfer request, passing the index of the cache-line to store the retrieved data. Wait for
the ide FSM (1222): (a) to indicate that the cache-line has been read and stored in the
cache-line store (1230), and (b) to forward a copy of the requested data to the front-side FSM.
In step 1413, issue a blocking command to the line store (1230) to read a copy of the
requested data and forward a copy of the requested data to the front-side FSM.
In step 1414, issue a memory transfer response containing the requested read data to the
interconnect target port.
In step 1415, if the memory er request received in step 1402 corresponds to a cache-line
that is present in the cache-line store (1230) then go to step 1421 otherwise go to step 1416.
In step 1416, if there is at least one unallocated cache-line available in the cache-line store
(1230) then go to step 1420, otherwise go to step 1417.
In step 1417, issue a non-blocking command to the status tag finite state machine (1232)
g the least ly used cache-line as being in the process of being evicted.
In step 1418, if the least recently used cache-line to be evicted is dirty and therefore must be
written out of the cache module (1200) then go to step 1419, otherwise go to step 1420.
In step 1419, issue a non-blocking d to the queuing FSM (1221) request an eviction of
the dirty line. Wait for a notification from the back-side FSM (1222) indicating that a
write transaction has completed.
In step 1420, issue a blocking command to the status tag finite state machine (1232) requesting
the allocation of an unallocated cache-line and e the index to that newly allocated cache-
line.
In step 1421, issue a non-blocking command to the cache-line store (1230) to write a copy of the
data received in the write memory transfer request to the location in the cache-line store (1230)
indicated by the index received in step 1420.
In step 1422, issue a ocking command to the status tag finite state machine (1232)
marking that cache-line as being dirty.
In step 1423, if this cache-line was previously clean, issue a non-blocking command to the
queuing FSM (1221) to inform it this cache-line is now dirty.
In step 1424, end the front-side FSM process.
In this way, we have demonstrated that the front-side FSM:
Signed B. Gittins
14 June 2020 54 NZ IP No. 716954
employs an allocate on read gy;
employs an allocate on write gy;
employs a least recently used eviction strategy; and
writes can be performed to any dirty cache-line which has been queued for on, but
not yet d.
Figure 17 is a flow-chart 1500 illustrating the steps of the queuing FSM (1221) of figure 15
ing to a preferred embodiment of the present invention. The process described in flow
chart (1400) is a onal description which executes every clock cycle that the cache module
(1200) is enabled. At least one of the 4 policies is ed at power on, and the currently active
policy can be changed at run time.
In step 1501, start the queuing FSM (1221) process.
In step 1502, receive any commands issued by the front FSM (1220);
In step 1503, receive any notifications issued by the back FSM (1222);
In step 1504, if there are no commands issued by the front FSM (1220) this clock cycle then go
to step 1514, otherwise go to step 1505.
In step 1505, if a read command is received in step 1502, go to step 1506. If an eviction
d is received in step 1502, go to step 1507. Otherwise a dirty cache-line notification
command has been received in step 1502 therefore go to step 1508.
In step 1506, store the read command in FIFO queue (1236); go to step 1508.
In step 1507, store the write command in FIFO queue (1235); go to step 1508.
In step 1508, if the currently active policy is policy 1, go to step 1509. If the currently active
policy is policy 2, go to step 1510. If the currently active policy is policy 3, go to step 1511.
Otherwise the currently active policy is policy 4 therefore go to step 1512.
In step 1509, policy 1 employs a policy in which a cache-line is solely evicted in response to
ing a memory transfer request which either:
flushes at least one specific cache-line; or
requires the allocation of at least one cache-line.
Policy 1 ignores all dirty cache-line notification commands received in step 1502. In a preferred
embodiment of the t invention, read and write operations will be queued in (1237) in the
order they are received. In an alternate preferred embodiment of the present invention, read
operations will take priority over queued write operations. Go to step 1513.
In step 1510, policy 2 employs a policy in which each cache-line is queued for eviction as soon
as it becomes dirty and a read-miss is serviced after all the currently outstanding dirty cachelines
have been evicted.
Signed B. Gittins
14 June 2020 55 NZ IP No. 716954
If a dirty cache-line notification command was received in step 1502 then generate a write
command and store it in the FIFO queue (1235) to queue writing this dirty line out of the
cache-module (1200). Go to step 1513.
In step 1511, policy 3 employs a policy in which each cache-line is queued for eviction as soon
as it becomes dirty and a read-miss is serviced before all the currently outstanding dirty cachelines
have been evicted.
If a dirty cache-line notification command was ed in step 1502 then generate a write
command and store it in the FIFO queue (1235) to queue writing this dirty cache-line out of the
cache-module (1200). Go to step 1513.
In step 1512, policy 4 employs a policy in which each cache-line is queued for eviction as soon
as it becomes dirty; and in which a read-miss is serviced before the eviction of the currently
outstanding dirty cache-lines queued for eviction on the condition that the execution time of each
of the outstanding dirty-cache-lines evictions is not modified as a result of executing the readmiss
operation first, otherwise the read-miss operation is delayed.
If a dirty cache-line notification d was ed in step 1502 then generate a write
command and store it in the FIFO queue (1235) to queue writing this dirty cache-line out of the
module (1200). Go to step 1513.
In step 1513, the t of the queue (1237) is d according to the currently active policy.
In step 1514, if there are no transaction-completed notifications issued by the back FSM (1220)
this clock cycle then go to step 1519, ise go to step 1515.
In step 1515, if the back FSM (1220) issued a read transaction completed notification go to step
1516, otherwise a write transaction completed cation has been issued and therefore go to
step 1517.
In step 1516, remove one element from the FIFO queue (1236). Go to step 1518.
In step 1517, remove one element from the FIFO queue (1235). Go to step 1518.
In step 1518, remove one element from the queue (1237).
In step 1519, release a copy of the head-of-line values for queues (1236), (1235), (1237) as input
to the back FSM (1222).
In step 1520, end the queuing FSM (1221) process.
Figure 18 is a flow-chart (1600) illustrating the steps of the back-side FSM (1222) of figure 15
according to a preferred embodiment of the t invention. The process described in flow
chart (1600) is a functional description which executes over 1 or more clock cycles. This
process assumes the interconnect connected to the cache modules’ master interconnect port
(1215) issues memory transfer responses to write memory transfer ts ting if the
transaction completed or needs to be resent because the transaction was corrupted before it could
Signed B. Gittins
14 June 2020 56 NZ IP No. 716954
be completed.
In step 1601, start the back-side FSM (1222) process.
In step 1602, receive any ds issued by the front FSM (1220);
In step 1603, receive a copy of the f-line values for queues (1236), (1235), (1237) and
store in variables R, W, and T respectively.
In step 1604, if there is no outstanding read memory transfer event R and no outstanding write
memory transfer event T, then go to step 1620, otherwise go to step 1605.
In step 1605, issue a blocking request to the interconnect master interface requesting a timeslot
on the interconnect (not rated). Preferably the interconnect (not illustrated) es the
interconnect master port (1215) that it will be granted a timeslot on the interconnect at least one
clock cycle before its allotted timeslot starts. The rest of this process assumes this is the case.
In step 1606 if the value of T indicates the read operation should be serviced go to step 1608
otherwise the write operation should be serviced therefore go to step 1607.
In step 1607, issue a blocking command to the cache-line store (1230) to read a copy of the
requested data to write as per write memory transfer event W.
In step 1608, issue a non-blocking command to the status tag finite state machine (1232)
updating the status of the cache-line as clean. Go to step 1609.
In step 1609, wait 1 clock cycle for the start of the memory transfer t timeslot on the
interconnect (not illustrated).
In step 1610, if the value of T indicates the read operation should be serviced go to step 1611
otherwise the write operation should be serviced therefore go to step 1615.
In step 1611, create a read memory transfer request in response to the read memory transfer
event R and issue that memory er request over the onnect master port (1215).
In step 1612, wait until the memory transfer response to the read memory transfer request issued
in step 1611 is received on interconnect master port .
In step 1613, issue a non-blocking command to the line store (1230) to write a copy of the
data received in step 1612 using the cache-line index stored in the read memory transfer event R.
In step 1614, issue a non-blocking command to the status tag finite state machine (1232)
updating the status of the portions of cache-line that are now valid. Go to step 1618.
In step 1615, create a write memory transfer request in response to the write memory transfer
event W and issue that memory transfer request over the interconnect master port (1215).
In step 1616, wait until the memory transfer se to the write memory transfer t issued
in step 1615 is received on interconnect master port 1215.
In step 1617, if the memory transfer response received in step 1616 request the write memory
transfer request is present, go to step 1615 otherwise go to step 1618.
Signed B. Gittins
14 June 2020 57 NZ IP No. 716954
In step 1618, issue a transaction complete notification to the front FSM (1220) and a full copy of
the memory transfer response.
In step 1619, issue a transaction te notification to the queuing FSM .
In step 1620, end the back-side FSM (1222) process.
In an alternate red embodiment of the present invention, the notification to the front side
FSM (1220) and queuing FSM (1221) of the completion of a write memory transfer request
which is currently performed in steps 1618 and 1619 can instead be performed in step 1608.
This may permit the front side FSM (1220) to continue processing its current memory transfer
request .
Figure 19 is a flow-chart 1000 illustrating the steps of the snoop FSM (1223) of figure 15
according to a preferred embodiment of the present invention. The process described in flow
chart (1000) is a functional description which executes over 1 or more clock cycles.
In step 1401, start the snoop FSM process.
In step 1002, m a blocking read to fetch the next element of snoop c received on the
two snoop ports (1212, 1213) from the FIFO queue (1214). In this embodiment snoop traffic is
encoded as a copy of the memory transfer request and its ponding memory transfer
response. Preferably all snoop traffic is transported and stored using forward error correcting
techniques. For example, the use of triple modular replication of all signals and registers, the use
of error correcting codes, or the use of double modular redundancy on communications paths
with time-shifted redundant transmission of messages with error checking codes.
In step 1003, if a read memory transfer request is received in step 1002, go to step 1008. If a
successful write memory transfer request has been received go to step 1004. Otherwise go to
step 1008. Preferably read memory transfer ts are not issued to the snoop ports (1212) and
(1213).
In step 1004, issue a blocking command to the address tag finite state machine (1321) to search
for the index of a cache-line by the s encoded in the memory transfer request ed in
step 1402.
In step 1005, if the cache-line is not present in the cache-line store (1230) then go to step 1008.
In step 1006, issue a blocking command to the cache-line store (1230) to write a copy of the data
stored in the memory transfer request into the corresponding line in the cache-line store
(1230). In this embodiment we have avoided adjusting the status valid status flags to avoid
introducing a modification of the execution time for memory transfer requests issued on the
interference-target port (1210). This is the preferred mode of operation when the processor core
Signed B. Gittins
14 June 2020 58 NZ IP No. 716954
is not fully timing compositional and suffers from timing anomalies.
In an alternate preferred embodiment of the present invention, a ng command is issued to
the status tag finite state machine (1232) to update which portions of the cache-lines are valid.
This may accelerate the execution time of memory transfer requests issued on the erence-
target port (1210) but may introduce additional complexity when performing worst case
execution time analysis of software running on the core associated with this cache.
In step 1007, end the snoop FSM (1222) process.
The cache module of figure 15 is employed as:
the cache modules {733.a, 733.b}, {743.a, 743.b} of figure 6; and
the cache modules {1351.a, 1352.a}, {1352.a, 1352.b} of figure 11.
In this way we have now described how the shared memory computing device of figure 6 and 15
comprises:
N fully associative cache modules, where the value of N is at least 1, each fully
associative cache module comprising:
a master port:
a target port;
a means to track dirty cache-lines; and
a finite state machine with one or more policies, in which at least one policy:
employs an allocate on read strategy;
employs an allocate on write strategy; and
employs a least recently used eviction strategy; and
N processor cores, in which each core:
is assigned a different one of the N fully ative cache modules as its e
cache.
The combined use of a fully-associative write-back cache modules with a least recently used
eviction scheme as thus described is particularly well suited for upper-bound WCET analysis.
In st, set-associative write-back caches with any type of eviction scheme (a mode of
operation found in a very large number of commercial computer architectures) is highly
rable for upper-bound WCET analysis due to the interaction between: unknown effective
addresses, the set-associative cache architecture, and the eviction of dirty cache-lines as a result
of n ive addresses.
Signed B. Gittins
14 June 2020 59 NZ IP No. 716954
With n effective addresses, for example that may occur as a result of a ependent
look up to an array that occupies more than one cache-line, it is not possible to statically
determine exactly which set of the set-associative cache is ed. As a result, upper-bound
WCET analysis tools must make conservative assumptions about any one of the sets of the cache
that could have been accessed by that unknown effective address. In a 4-way sociative
cache, this can lead to the pessimistic tion by an upper-bound WCET analysis tool that a
full 25% of the cache-lines in the cache store may not be present. In both through and
write-back modes of operation, upper-bound WCET analysis tools work on the worst case
assumption that none of those potentially evicted cache-lines will now be present and that a read
memory transfer request to a cache-line that was present must be re-read. However in write back
mode of operation, upper-bound WCET analysis tools must also make pessimistic assumptions
about the back operations that may occur as a result of cache-lines that were dirty before
the unknown effective addresses lookup. Furthermore, if the cache-lines are backed in SDRAM
using an open-page mode of ion, those write-back operations may adjust which rows are
open in that SDRAM and thus the timing of ions to that SDRAM. Consequently this
combination of write back mode of operation with set-associative caches can result in quite
pessimistic upper-bound WCET results when compared to write through mode operation with
set-associative caches. The later being the most r mode of operation for performing
bound WCET analysis today.
In contrast, a fully-associative cache with least recently used eviction scheme does not introduce
any ambiguity as to which cache-line would be evicted on an unknown effective address. Using
associative caches with least recently used eviction schemes and back mode of
operation as described above will tend to result in better upper-bound WCET analysis results
when compared to set associative caches with write-through mode of operation, and fullyassociative
caches with least recently used eviction schemes and write-through mode of
operation.
This technique can be used with some processor cores that do exhibit timing effects (such as the
Freescale MPC755), although it is preferred that those cores do not exhibit timing effects.
Figure 20 is a diagram illustrating the fields 2020 of a memory transfer request (2000) and the
fields of its corresponding memory transfer response (2010) which includes a copy of the
ponding memory transfer request (2010) according to a preferred embodiment of the
present invention. In figure 20, the memory transfer request (2000) comprises:
an 8-bit field (2001) indicating uniquely identifier an interconnect-master within the
Signed B. Gittins
14 June 2020 60 NZ IP No. 716954
computing architecture;
an 8-bit field (2002) indicating the ction ID for that interconnect-master;
a 4-bit field (2003) indicating the transaction type, for example, a read or write memory
transfer t type;
a 5-bit field (2004) used to te the size of the memory transfer request in bytes;
a 32-bit field (2005) used to indicate the address of the memory transfer request in bytes;
a 256-bit field (2006) used to store the data to write for write memory transfer requests.
In figure 20, the memory transfer response (2010) ses:
a copy of the memory transfer request, which comprises:
an 8-bit field (2001) indicating uniquely identifier an interconnect-master within
the computing architecture;
an 8-bit field (2002) indicating the transaction ID for that interconnect-master;
a 4-bit field (2003) indicating the transaction type, for example, a read or write
memory transfer request type;
a 5-bit field (2004) used to indicate the size of the memory transfer request in
bytes;
a 32-bit field (2005) used to te the address of the memory transfer request in
bytes;
a 256-bit field (2011) used to store the data to write for write memory transfer
requests; and
a 4-bit response status field (2012).
The field (2011) is used to store the data read for read memory transfer requests. Figure 20
illustrates that the memory transfer response has all the essential meta-data used in the original
memory transfer request. In preferred embodiments, bus protocols do not use the transaction ID
field (2002) if they do not employ transaction ID’s.
s embodiments of the invention may be ed in many different forms, including
computer program logic for use with a processor (eg., a microprocessor, microcontroller, digital
signal processor, or general purpose computer), mmable logic for use with a
programmable logic device (eg., a field programmable gate array (FPGA) or other PLD),
discrete components, integrated circuitry (eg., an application specific integrated circuit (ASIC)),
or any other means including any ation f. In an exemplary embodiment of the
present invention, predominantly all of the communication between users and the server is
Signed B. Gittins
14 June 2020 61 NZ IP No. 716954
implemented as a set of computer program instructions that is ted into a computer
executable form, stored as such in a computer readable medium, and executed by a
microprocessor under the control of an operating system.
Computer program logic implementing all or part of the functionality where described herein
may be embodied in various forms, including a source code form, a computer executable form,
and various intermediate forms (e.g., forms ted by an assembler, compiler, linker, or
locater). Source code may include a series of er program instructions implemented in any
of various programming languages (e.g., an object code, an assembly language, or a high-level
language such as ADA SPARK, Fortran, C, C++, JAVA, Ruby, or HTML) for use with various
ing systems or operating environments. The source code may define and use various data
structures and communication messages. The source code may be in a er able
form (e.g., via an interpreter), or the source code may be ted (e.g., via a ator,
assembler, or compiler) into a computer executable form.
The computer program may be fixed in any form (e.g., source code form, computer executable
form, or an intermediate form) either ently or transitorily in a tangible storage medium,
such as a semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-
Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical
memory device (e.g., a CD-ROM or DVD-ROM), a PC card (e.g., PCMCIA card), or other
memory device. The computer program may be fixed in any form in a signal that is
transmittable to a computer using any of s communication technologies, including, but in
no way limited to, analog technologies, digital technologies, optical technologies, wireless
technologies (e.g., Bluetooth), networking technologies, and inter-networking technologies. The
computer program may be distributed in any form as a removable e medium with
accompanying printed or electronic documentation (e.g., shrink wrapped software), preloaded
with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or
electronic bulletin board over the communication system (e.g., the internet or world wide web).
Hardware logic (including mmable logic for use with a programmable logic device)
implementing all or part of the functionality where described herein may be designed using
traditional manual s, or may be designed, captured, simulated, or nted
electronically using various tools, such as computer aided design (CAD), a hardware description
language (e.g., VHDL or AHDL), or a PLD programming language (e.g., PALASM, ABEL, or
CUPL).
Signed B. Gittins
14 June 2020 62 NZ IP No. 716954
Programmable logic may be fixed either permanently or transitorily in a tangible storage
medium, such as a nductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or
Flash-Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical
memory device (e.g., a CD-ROM or DVD-ROM), or other memory device. The programmable
logic may be fixed in a signal that is transmittable to a computer using any of s
communication technologies, including, but in no way limited to, analog technologies, digital
technologies, optical technologies, wireless technologies (e.g., Bluetooth), networking
technologies, and internetworking logies. The programmable logic may be distributed as
a removable e medium with accompanying printed or electronic ntation (e.g.,
shrink wrapped software), preloaded with a computer system (e.g., on system ROM or fixed
disk), or buted from a server or electronic bulletin board over the communication system
(e.g., the internet or world wide web).
Throughout this specification, the words “comprise”, “comprised”, “comprising” and
“comprises” are to be taken to specify the presence of stated features, integers, steps or
components but does not preclude the presence or addition of one or more other features,
integers, steps, components or groups thereof.
REFERENCES
[1] G. Gebhard. Timing anomalies reloaded. In B. Lisper, editor, WCET, volume 15 of
OASICS, pages 1–10. s Dagstuhl - Leibniz-Zentrum fuer Informatik, y, 2010.
ARM AMBA Specification (Rev 2.0), 1999. ARM IHI 0011A
Aeroflex Gaisler. NGMP ication, Next tion Multi-Purpose Microprocessor.
Report, European Space Agency, Feb 2010. Contract 22279/09/NL/JK.
http://microelectronics.esa.int/ngmp/NGMP-SPECi1r4.pdf
F. J. a, R. Gioiosa, M. Fernandez, E. Quinones, M. Zulianello, and L. Fossati.
Multicore OS Benchmark (for NGMP). Final report, Barcelona Supercomputing Centre, 2012.
Under contract RFQ13153/10/NL/JK.
http://microelectronics.esa.int/ngmp/MulticoreOSBenchmark-FinalReport_v7.pdf
Signed B. Gittins
14 June 2020 63 NZ IP No. 716954
Claims (5)
1. A shared memory computing device comprising: a shared memory; at least one interconnect master, in which each interconnect master is adapted to 5 issue memory transfer requests that can be ed by the shared memory; N cache modules, where the value of N is at least 1, each cache module comprising: a master port; a target port that is adapted to issue memory transfer requests that can be 10 received by the shared memory; and means to implement an update-type cache coherency policy; M processor cores, where the value of M is equal to the value of N, in which each processor core: is ed a ent one of the N cache modules as that processor core’s 15 private cache; and in which the memory access latency of omic memory transfer requests issued by each of the M processor cores is not modified by: the memory transfer requests issued by any of the at least one interconnect masters.
2. A shared memory ing device as claimed in claim 1, in which the value of N is at least 2 and in which the memory access latency of non-atomic memory transfer requests issued by each of the M processor cores is not modified by the memory transfer requests issued by any of the other M processor cores.
3. A shared memory computing device as claimed in claim 2, in which at least one of the N cache modules is adapted to maintain coherency with regard to the data of the write memory transfer requests received on the target port of a different one of the N cache Signed B. Gittins 14 June 2020 64 NZ IP No. 716954
4. A shared memory computing device as claimed in any one of claims 1 to 3, in which at least one of the N cache modules is adapted to maintain coherency with regard to the data of the write memory transfer requests issued by one of the at least one interconnect masters to a memory address located in the shared .
5. A shared memory computing device as claimed in any one of claims 1 to 4, in which at least one of the N cache modules is a fully associated cache. Signed B. Gittins 14 June 2020 65 NZ IP No. 716954 This is the last page of the specifications and . Pages 65 to 80 are intentionally left blank. Signed B. Gittins 14 June 2020 page 1 of 18 NZ IP No. 716954 s 314 i te 339 mi e 3 3 4 e ti 338 me 384
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2013902678 | 2013-07-18 | ||
AU2013902678A AU2013902678A0 (en) | 2013-07-18 | Computing architecture with peripherals | |
AU2013904532 | 2013-11-25 | ||
AU2013904532A AU2013904532A0 (en) | 2013-11-25 | Computing architecture with peripherals | |
PCT/IB2014/063189 WO2015008251A2 (en) | 2013-07-18 | 2014-07-17 | Computing architecture with peripherals |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ716954A NZ716954A (en) | 2021-02-26 |
NZ716954B2 true NZ716954B2 (en) | 2021-05-27 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10210117B2 (en) | Computing architecture with peripherals | |
US20240184446A1 (en) | Multi-processor bridge with cache allocate awareness | |
US7406086B2 (en) | Multiprocessor node controller circuit and method | |
US7818388B2 (en) | Data processing system, method and interconnect fabric supporting multiple planes of processing nodes | |
US8732398B2 (en) | Enhanced pipelining and multi-buffer architecture for level two cache controller to minimize hazard stalls and optimize performance | |
US6279084B1 (en) | Shadow commands to optimize sequencing of requests in a switch-based multi-processor system | |
CN1940904B (en) | Data processing system and method | |
US6249520B1 (en) | High-performance non-blocking switch with multiple channel ordering constraints | |
US20040024925A1 (en) | Computer system implementing synchronized broadcast using timestamps | |
US20010055277A1 (en) | Initiate flow control mechanism of a modular multiprocessor system | |
JP2000231536A (en) | Circuit having transaction scheduling of state base and its method | |
CN102375800A (en) | Multiprocessor system-on-a-chip for machine vision algorithms | |
JP2016503934A (en) | Context switching cache system and context switching method | |
US6877056B2 (en) | System with arbitration scheme supporting virtual address networks and having split ownership and access right coherence mechanism | |
US20070005909A1 (en) | Cache coherency sequencing implementation and adaptive LLC access priority control for CMP | |
Ang et al. | StarT-Voyager: A flexible platform for exploring scalable SMP issues | |
US7680971B2 (en) | Method and apparatus for granting processors access to a resource | |
US7882309B2 (en) | Method and apparatus for handling excess data during memory access | |
US20090006777A1 (en) | Apparatus for reducing cache latency while preserving cache bandwidth in a cache subsystem of a processor | |
US6145032A (en) | System for recirculation of communication transactions in data processing in the event of communication stall | |
US20080192761A1 (en) | Data processing system, method and interconnect fabric having an address-based launch governor | |
JP2002198987A (en) | Active port of transfer controller with hub and port | |
NZ716954B2 (en) | Computing architecture with peripherals | |
Exploring Scalable | CSAIL | |
Daya | SC²EPTON: high-performance and scalable, low-power and intelligent, ordered Mesh on-chip network |