GB2635733A

GB2635733A - Technique for performing custom operations on data in memory

Info

Publication number: GB2635733A
Application number: GB2317928.6A
Authority: GB
Inventors: Ola Harald Liljedahl Eric
Original assignee: ARM Ltd; Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2025-05-28
Also published as: GB202317928D0; WO2025109301A1

Abstract

An apparatus 5 is provided that has decoder circuitry 10 to decode instructions of a first instruction set, wherein the decoder circuitry is responsive to instructions of the first instruction set to generate control signals, and processing circuitry responsive to the control signals to cause operations defined by the instructions to be performed. The decoding circuitry is arranged to be responsive to a program-specifying instruction (Figure 2) of the first instruction set that specifies a memory location operand and a program operand to issue control signals to the processing circuitry to cause a second instruction set processing unit 40 to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand. The program comprises one or more instructions of the second instruction set defining operations supported by the second instruction set processing unit (Figure 6).

Description

TECHNIQUE FOR PERFORMING CUSTOM OPERATIONS ON DATA IN MEMORY

BACKGROUND

The present technique relates to the field of data processing.

Within a data processing system, processing circuitry may be provided to perform the operations defined by instructions of a given instruction set. Typically, the processing circuitry will have access to a set of registers in which source data used when performing the operations, and result data generated as a result of performing the operations, is temporarily stored. Load instructions can be executed to load data from memory into the registers, store instructions can be used to store data back from the registers to memory, and data processing instructions (such as may be used to define arithmetic operations or logical operations) provided by the given instruction set may then specify one or more registers containing the source data to be operated on, and/or the registers into which generated result data is to be stored.

In some instances, it may be desirable to perform operations on data held in memory, without needing to first load the data from memory into the set of registers. Dedicated instructions can be defined for this purpose, each such dedicated instruction defining a particular operation to be performed on data in memory. However, there is typically a significant constraint on encoding space within any given instruction set, and hence the number of different instructions that can be defined is limited. This will typically mean that only a relatively small number of dedicated instructions to operate on data in memory can be defined. It would be desirable to alleviate this constraint, so as to allow for a wider variety of operations to be performed on data held in memory.

SUMMARY

In accordance with a first example arrangement, there is provided an apparatus comprising: decoder circuitry to decode instructions of a first instruction set, wherein the decoder circuitry is responsive to instructions of the first instruction set to generate control signals; and processing circuitry responsive to the control signals to cause operations defined by the instructions to be performed; wherein: the decoding circuitry is arranged to be responsive to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand to issue control signals to the processing circuitry to cause a second instruction set processing unit to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand, the program comprising one or more instructions of the second instruction set defining operations supported by the second instruction set processing unit.

In accordance with another example arrangement, there is provided a method of operating an apparatus, comprising: employing decoder circuitry to decode instructions of a first instruction set, wherein the decoder circuitry is responsive to instructions of the first instruction set to generate control signals; employing processing circuitry to be responsive to the control signals to cause operations defined by the instructions to be performed; and arranging the decoding circuitry to issue, in response to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand, control signals to the processing circuitry to cause a second instruction set processing unit to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand; wherein the program comprises one or more instructions of the second instruction set defining operations supported by the second instruction set processing unit.

In accordance with a still further example arrangement, there is provided a computer program comprising instructions which, when executed by a host data processing apparatus, control the host data processing apparatus to provide an instruction execution environment for executing target program code, the computer program comprising: instruction decoding program logic to decode instructions of a first instruction set, wherein the instruction decoding program logic is responsive to instructions of the first instruction set to generate control signals; and data processing program logic responsive to the control signals to cause operations defined by the instructions to be performed; wherein: the instruction decoding program logic is arranged to be responsive to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand to issue control signals to the data processing program logic to cause second instruction set processing program logic to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand, the program comprising one or more instructions of the second instruction set defining operations supported by the second instruction set processing program logic. Such a computer program can be disposed in any known transitory computer-readable medium (such as wired or wireless transmission of code over a network) or non-transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc.

In a yet further example arrangement, there is provided a computer-readable medium to store computer-readable code for fabrication of an apparatus in accordance with the first example arrangement discussed above. The computer-readable medium may be a transitory computer-readable medium or a non-transitory computer-readable medium.

BRIEF DESCRIPTION OF THE DRAWINGS

Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings, in which: Figure 1 is a block diagram schematically illustrating a data processing system in accordance with one example implementation; Figure 2 schematically illustrates fields provided within a user-defined atomic instruction, in accordance with one example implementation; Figure 3 is a block diagram illustrating more details of an atomic processing unit that may be provided in accordance with one example implementation; Figure 4 is a flow diagram illustrating steps performed upon decoding a user-defined atomic instruction, in accordance with one example implementation; Figure 5 illustrates in more detail the operation of the instruction buffer of the atomic processing unit of Figure 3, in accordance with one example implementation; Figure 6 illustrates, by way of specific example, how the instructions of a program loaded into the instruction buffer may be decoded, in accordance with one example implementation; Figure 7 is a block diagram schematically illustrating a data processing system in accordance with another example implementation; and Figure 8 illustrates a simulator implementation.

DESCRIPTION OF EXAMPLES

In accordance with the techniques described herein, an apparatus is provided that has decoder circuitry for decoding instructions of a first instruction set, and in particular for generating control signals in response to decoding each instruction of the first instruction set, with those control signals being passed to other elements within the apparatus to trigger performance of the operations required by the decoded instructions. Further, the apparatus has processing circuitry that is responsive to the control signals to cause the operations defined by the instructions to be performed. By such an approach, a sequence of instructions of the first instruction set can be decoded by the decoder circuitry, with the processing circuitry then performing the operations required by that sequence of instructions.

In addition, in accordance with the techniques described herein, a program-specifying instruction is defined within the first instruction set. The program-specifying instruction specifies a memory location operand and a program operand, and the decoding circuitry is arranged in response to decoding such an instruction to issue control signals to the processing circuitry that cause a second instruction set processing unit to be triggered to execute a program identified by the program operand. In particular, the second instruction set processing unit will then perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand. The program comprises one or more instructions of the second instruction set defining operations supported by the second instruction set processing unit.

By such an approach, a single instruction of the first instruction set can be used to implement a wide variety of custom operations on data held in memory. In particular, a program can be written using the instructions of the second instruction set, with that program being identified by one of the operands of the program-specifying instruction. The task of executing that program can then be offloaded to the second instruction set processing unit, which can then execute each of the instructions defined within the program in order to perform a sequence of operations on data accessed in memory.

Whilst the second instruction set could take a variety of forms, in one example implementation the second instruction set will take the form of a special-purpose instruction set containing a limited number of instructions that define a core set of operations that can be performed by the second instruction set processing unit. It is desirable to keep the program written using the instructions of the second instruction set relatively small and simple so that it will execute in a predictable manner in a finite period of time. By keeping the number of instructions provided by the second instruction set limited, it is possible to uniquely define the instructions with a relatively small encoding, hence enabling the entire program made up of multiple instructions of the second instruction set to be defined by a relatively small number of bits. It has been found that with such an approach a wide variety of useful sequences of operations can be defined by such a program, with execution of that program being triggered by the single program-specifying instruction of the first instruction set, thereby avoiding the need to separately encode multiple different instructions of the first instruction set to seek to define all of the operations that it would be desirable to perform on data held in memory.

There are various scenarios where it may be useful to perform operations on data held in memory rather than first loading that data into the registers accessible to the processing circuitry. For example, the techniques described herein could be used to implement a variety of different offload mechanisms where it is desirable to use the second instruction set processing unit to perform some in-memory computation operations. In one particular example implementation, the use of the program-specifying instruction can be used to perform a sequence of operations atomically, which for example can be very useful in situations where multiple elements in a data processing system share access to data. The sequence of operations performed atomically may be referred to collectively as an atomic operation, and when an atomic operation is performed by a particular process within the system, no other process within the system is able to read or change state that is read or changed during the atomic operation. As a result, the atomic operation is effectively executed as a single step, and is a very useful type of operation in a variety of situations in data processing systems. An atomic operation will typically implement a read, modify and write sequence where one or more items of data are read from memory, modified, and then written back to the same memory location.

Accordingly, in one example implementation, the program-specifying instruction is an atomic instruction, and in particular is an atomic instruction that specifies a program operand whose instructions are to be executed in order to perform a corresponding sequence of operations atomically. In such an implementation, the second instruction set processing unit can be considered to be an atomic processing unit that is arranged to execute instructions of the second instruction set in order to perform a sequence of operations atomically. In accordance with such an implementation, hardware within the system can be used to ensure that the sequence of operations is performed atomically. For instance, the atomic processing unit may be coupled to a particular level of cache, and the cache coherency protocol employed within the system can be used to ensure that that cache has ownership of one or more cache lines worth of data that is to be processed during execution of the program by the atomic processing unit, thereby ensuring that no other entities within the system will read or modify the data whilst it is being processed by the atomic processing unit during execution of the program.

There are a variety of ways in which the program operand of the program-specifying instruction can be used to identify the program to be executed. For instance, the program operand could identify a location in memory containing the program and hence specify the program by reference to the location in memory containing that program. However, in one example implementation, the apparatus has a set of registers that is accessible to the processing circuitry, and the program-specifying instruction is configured to specify the program operand by identifying at least one register from the set of registers containing the program to be executed by the second instruction set processing unit. Hence, in such an implementation the program operand directly specifies the program, since the program is contained within the register or registers forming the program operand. This provides a particularly simple and efficient mechanism. In particular, by arranging the second instruction set to include only a relatively small set of instructions, tailored specifically to the type of operations that it may be desired for the second instruction set processing unit to perform, it is possible to construct a program containing multiple of those instructions where the overall program size is small enough to fit within one, or a few, registers. In one specific example implementation, 64-bit registers may be provided within the set of registers, and the entire program may be accommodated within a single 64-bit register, or within a couple of such 64-bit registers. As a result, when the processing circuitry triggers the second instruction set processing unit to execute the program, it can merely read the program from the relevant register or registers and pass that program as an input to the second instruction set processing unit.

Such an approach enables a great degree of customisability in the defining of a sequence of operations to be performed by the second instruction set processing unit. In particular, the user/programmer can create a suitable program using the limited instructions of the second instruction set, and that program can be loaded into a register (or a couple of registers if required) of the register set prior to the program-specifying instruction being executed, with that program-specifying instruction identifying the register or registers containing the program.

In one example implementation, in addition to the earlier mentioned memory location operand and program operand, the program-specifying instruction may further specify at least one input value operand identifying at least one input value to be used by the second instruction set processing unit when performing the sequence of operations defined by the program. Hence, when the processing circuitry is triggering the second instruction set processing unit to execute the program, it can, in addition to identifying the program to be executed, input to the second instruction set processing unit the one or more input values specified by the input value operand(s). Each input value operand may identify the associated input value in a variety of ways. For example, the program specifying instruction may identify one or more of the input values as immediate values directly encoded within the instruction, or alternatively may identify one or more registers containing those input values. When the input values are specified with reference to registers, then, in much the same way as discussed earlier in relation to the program operand, the processing circuitry can be arranged, when triggering the second instruction set processing unit to execute the program, to access the relevant register or registers in order to obtain the one or more input values, and then to provide those input values to the second instruction set processing unit.

As mentioned earlier, in one example implementation the aim of the second instruction set is to provide a relatively limited set of instructions that define a range of useful operations that can be used to construct a variety of different programs that could be executed by the second instruction set processing unit. By providing only a relatively limited set of instructions, it is possible to achieve a very efficient encoding of the instructions and hence enable the entire program to be constructed using a relatively small number of bits, thus for example enabling the entire program to be stored within one or a couple of registers. In accordance with such an implementation, each instruction of the second instruction set may be arranged to be encoded using less bits than each instruction of the first instruction set.

In one example implementation, the second instruction set is a variable length instruction set such that the number of bits used to encode any given instruction of the second instruction set is dependent on the type of that given instruction. Hence, in one example implementation, an opcode portion of an instruction of the second instruction set can be analysed in order to identify the type of instruction, and once that has been determined the total number of bits defining the instruction is then known, and can be analysed accordingly. Such an approach provides a particularly efficient encoding, by enabling any redundant encoding space to be avoided, and thereby allowing the entire program to be defined in an efficient manner (thus for example enabling the entire program to be stored within a single register in some instances, and within only a couple of registers in other instances).

The second instruction set processing unit can be provided at a variety of locations within an apparatus. For instance, the apparatus may be arranged to be coupled to a memory system providing multiple levels of cache, and the second instruction set processing unit may be provided in association with one of those levels of cache. In one particular example implementation, the second instruction set processing unit is provided by the processing circuitry, and is arranged to perform the sequence of operations on data held in a level of cache accessible to the processing circuitry. Such an implementation may be referred to as a "near" implementation, since the data is held in a level of cache relatively closely coupled to the processing circuitry.

However, the techniques described herein are not limited to use in association with such a near implementation. For instance, in an alternative example implementation, the processing circuitry may be arranged to be coupled to the second instruction set processing unit via an interconnect, and to trigger the second instruction set processing unit to execute the program by asserting a request over the interconnect to cause the second instruction set processing unit to perform the sequence of operations on data held in a level of cache associated with the second instruction set processing unit. In this case, the second instruction set processing unit may be located further out in the system, in association with a level of cache more remote from the processing circuitry, for example a level of cache shared between multiple instances of processing circuitry within the system. Such an implementation may be referred to as a "far" implementation.

Furthermore, in some example implementations, multiple instances of the second instruction set processing unit may be provided, for example one instance closely coupled to an associated instance of the processing circuitry, and one instance further out in the system, and in such an implementation dynamic switching between near and far implementations may take place taking into account a variety of factors. For example, the amount of contention for access to the data by multiple instances of processing circuitry may be considered when deciding whether to execute the program using the near implementation or the far implementation.

When multiple instances of processing circuitry are regularly seeking to access the same data, it can be more efficient to hold that data in a level of cache shared by those instances of the processing circuitry, since this can reduce the level of snooping required when implementing a cache coherency protocol, and reduce the extent to which data is moved around between caches when compared with an approach where the data is held in caches more closely coupled to the instances of the processing circuitry. Hence, in situations where there is likely to be a reasonable amount of contention, use of the far implementation may be likely to provide a higher aggregated update rate and thus better scalability.

The second instruction set processing unit can be configured in a variety of ways. However, in one example implementation the second instruction set processing unit comprises instruction buffer storage to receive the program identified by the program operand, further decoding circuitry to analyse the content of the instruction buffer storage in order to decode each instruction of the second instruction set provided by the program, and execution circuitry to perform the operations required by each instruction in response to control signals generated by the further decoding circuitry. Hence, the further decoding circuitry can be arranged to analyse the series of bits forming the program as received into the instruction buffer storage in order to decode each instruction of the program and send appropriate control signals to the execution circuitry to cause the required operations to be performed.

In one example implementation, the further decoding circuitry is arranged, for each instruction of the second instruction set provided by the program, to determine from opcode bits the type of instruction, and to determine a total number of bits defining that instruction in dependence on the determined type of instruction. As noted earlier, in one example implementation the second instruction set may be a variable length instruction set, and hence it is necessary to evaluate the opcode in order to determine the type of instruction before it can then be determined the totality of the number of bits being used to define that instruction. In one particular example implementation, the further decoding circuitry can be arranged to start analysing the contents of the instruction buffer storage beginning from the first end of the instruction buffer storage, so as to decode each instruction of the program sequentially until it has been determined that all of the instructions forming the program have been decoded and actioned.

The instructions provided within the second instruction set can take a variety of forms.

In one example implementation, the second instruction set includes an end instruction that is used to identify when an end of the program has been reached, and the end instruction is encoded as a sequence of bits that all have a same predetermined bit value (i.e. all Os or all 1s). This can significantly improve efficiency, since it can enable early termination of the program in situations where the program does not occupy the entirety of the bit storage space within the instruction buffer storage. In particular, once an end instruction has been detected, there is no need to evaluate any remaining bit positions within the instruction buffer storage.

In one particular example implementation, as each instruction is decoded by the further decoding circuitry, the second instruction set processing circuitry is arranged to extend the series of bits forming the program by appending a number of bits of the predetermined value. Such an approach can further improve efficiency when analysing the program within the second instruction set processing unit, since it allows an implicit end instruction to be incorporated within the series of bits. In particular, as instructions are decoded, and the series of bits is extended by appending a number of bits of the predetermined value, a point will be reached where the further decoding circuitry identifies a currently analysed sequence of bits as forming the earlier-mentioned end instruction, due to the fact that the end instruction is encoded as a sequence of bits that all have that predetermined value. Such an approach can for example be useful if there is not space in the registers used to contain the program to include an explicit end instruction. For instance, if it is desirable to constrain the program to fit within a single 64-bit register, then it can be useful if there is no need to explicitly use any of those bits to identify an end instruction, and the above approach enables this to be achieved, since the extending of the series of bits using the approach described above will result in an end instruction effectively being added as the instructions are decoded.

The actual number of bits appended as each instruction is decoded may be varied dependent on implementation. However, in one example implementation, the number of bits appended when a given instruction is decoded is equal to the total number of bits forming that given instruction.

In one example implementation, the second instruction set includes a skip instruction that is used to cause one or more subsequent instructions in the program to be skipped when a condition defined by the skip instruction is met. This provides increased flexibility in defining the program that is to be performed, by allowing some of the operations defined within the program to only be performed under certain conditions. Further, by using a skip instruction that can only allow forward movement through the program, rather than a more general branch instruction that may allow backwards branching, this can avoid the potential for the program to take a non-deterministic period of time to complete, and indeed avoids the potential for the program to get into an infinite loop where it never completes.

There are various ways in which it is possible to maintain sufficient status information to enable an assessment to be made as to whether the condition defined by the skip instruction in met. In one example implementation, at least one instruction in the second instruction set is arranged, when executed, to cause one or more condition code flags to be set in dependence on a result generated by execution of that instruction. Hence, as instructions within the program are executed, one or more of them may update condition code flags, and the execution circuitry can then be arranged, when subsequently encountering an instance of the skip instruction, to determine whether the condition defined by that skip instruction is met with reference to a state of the one or more condition code flags.

There are various ways in which the condition may be defined by the skip instruction, but in one example implementation the condition is defined by a condition field within the encoding of the skip instruction. In implementations where the second instruction set also includes the earlier-mentioned end instruction, then in one example implementation the end instruction may be encoded with identical opcode bits as are used to identify the skip instruction, but with the condition field set to a reserved value not used to define a condition. This avoids the need to have a separate explicit encoding for the end instruction since if a decoding of the opcode bits identifies a skip instruction, but the condition field is set to a reserved value, that instruction will instead be treated by the further decoding circuitry as an end instruction. In addition, in certain implementations this approach supports the use of an implicit end instruction as discussed earlier, when both the opcode bits for a skip instruction and the reserved value of the condition field have the same predetermined bit values, for example all zeros (or indeed in other implementations all ones).

In one example implementation the second instruction set processing unit comprises a plurality of internal registers. In one example implementation these internal registers are different registers to the set of registers mentioned earlier that are accessible to the processing circuitry, and instead are registers used solely by the second instruction set processing unit.

However, in other implementations, it may be possible for the processing circuitry and the second instruction set processing unit to share access to the same set of physical registers. For example, register renaming techniques may be used to map logical registers to physical registers, and such an approach could also be used to map the internal register specifiers of instructions of the second instruction set to physical registers within the set of registers accessible to the processing circuitry, with the second instruction set processing unit then accessing those registers.

When triggered to perform the sequence of operations defined by the program, the second instruction set processing unit may be arranged to store in a given internal register source data obtained from the location in memory identified by the memory location operand.

The entity responsible for obtaining that source data from the location in memory may take a variety of forms. For instance, where the memory location operand identifies a register of the set of registers accessible by the processing circuitry (or indeed in some implementations a stack pointer) as containing the address information needed to identify the location in memory, then in accordance with one example approach the processing circuitry may obtain that address information from the register (or stack pointer), and pass that address information to the second instruction set processing unit, with the second instruction set processing unit then reading the source data from that memory location and storing the read source data into the given internal register. However, in one particular example implementation, the processing circuitry obtains the address information from the relevant register or stack pointer, and then itself reads the source data from the required memory location, with that source data then being passed as an input to the second instruction set processing unit, such that the second instruction set processing unit then merely needs to store that provided source data in the given internal register.

The given internal register that is used to store the source data obtained from the identified location in memory can take a variety of forms. For example, the internal register used may be hardwired, and/or predefined, so that the same specific internal register is used to store that information each time a program is executed by the second instruction set processing unit.

Having stored the source data within the given internal register as noted above, then on completion of the sequence of operations the second instruction set processing unit may be arranged, at least when result data differs from the source data, to write the result data to the location in memory identified by the memory location operand. Hence, the sequence of operations defines a form of read, modify and write operation sequence where data is read from memory, one or more computations are performed on that data, and then the result data is written back to the same memory location. In one example implementation, the result data is written back to the memory location irrespective of whether it is different to the original source data. However, if desired, a check may be performed by the second instruction set processing unit to determine whether the result data is different to the source data, with the result data only being written back to the memory location if it does indeed differ from the source data.

In one example implementation, a different internal register to the earlier-mentioned given internal register could be used to store the result data generated during execution of the instructions of the program. However, in one particular example implementation, during performance of the sequence of operations the second instruction set processing unit is arranged to cause the result data to be maintained in the given internal register, hence overwriting the initial source data at least in the event that the result data differs from the initial source data. This can be achieved by using "destructive" instructions whose result overwrites the contents of a source operand, which can be beneficial in the current situation since less instruction encoding space is required to define a given instruction due to the fact that one item of operand information identifies both a location of source information and the location which will be used to store result information.

As mentioned earlier, in one example implementation the program-specifying instruction may specify at least one input value operand identifying at least one input value to be used by the second instruction set processing unit when performing the sequence of operations defined by the program. In such an implementation, the processing circuitry may then be arranged to provide the at least one input value to the second instruction set processing unit for storing in at least one internal register of the plurality of internal registers other than the earlier-mentioned given internal register. The second instruction set processing unit can then operate on such input values by accessing them in its own internal registers.

There are various ways in which the at least one input value operand may identify the at least one input value. In one example implementation, the program-specifying instruction is configured to specify the at least one input value operand by identifying at least one register accessible to the processing circuitry that contains the at least one input value. The processing circuitry can then be arranged to read the at least one input value from the at least one register for provision to the second instruction set processing unit. The second instruction set processing unit may then be arranged during performance of the sequence of operations to maintain at least one output value, and on completion of the sequence of operations may be arranged to provide the at least one output value to the processing circuitry for storing in the at least one register that contained the at least one input value. Hence, once the sequence of operations has been performed, then one or more registers accessible to the processing circuitry can be updated with the relevant output value or values, and that information may then for example be referred to by the processing circuitry to determine the outcome of the program executed in response to the program-specifying instruction.

In one example implementation, during performance of the sequence of operations the second instruction set processing unit may be arranged to cause the at least one output value to be maintained in the at least one internal register of the plurality of internal registers used to store the at least one input value. As with the earlier discussion of the result data being arranged to overwrite the source data within the given internal register, the above arrangement of using the same internal register to maintain an output value as was used to store the original corresponding input value can be achieved by using "destructive" instructions, which can allow for a more efficient encoding due to a reduction in the number of operands that need to be defined.

Particular example implementations will now be described with reference to the figures.

Figure 1 schematically illustrates an example of a data processing apparatus 5. It will be appreciated that this is simply a high-level representation of a subset of components of the apparatus, and the apparatus may include many other components not illustrated. The apparatus 5 comprises processing circuitry 15 for performing data processing in response to instructions decoded by an instruction decoder (also referred to herein as decoder circuitry) 10. The instruction decoder 10 decodes instructions fetched from an instruction cache 50 in order to generate control signals 12 for controlling the processing circuitry 15 to perform corresponding operations represented by the instructions. The processing circuitry 15 may include one or more execution units for performing operations on values stored in registers 20 to generate result values to be written back to the registers. For example, the execution units could include an arithmetic/logic unit (ALU) 25 for executing arithmetic operations or logical operations, a floating-point (FP) unit 34 for executing operations using floating-point operands, and indeed other units such as a vector processing unit (not shown) for performing vector operations on operands including multiple independent data elements. The processing circuitry may also include a load/store unit (which may also be referred to as a memory access unit) 35 for controlling transfer of data between the registers 20 and the memory system. In this example, the memory system includes the instruction cache 50, a level 1 data cache 45, a level 2 cache 55 shared between data and instructions, and main memory 60. It will be appreciated that other cache hierarchies are also possible -this is just one example.

It will be appreciated that other elements (not shown) may be provided in association with the load/store unit 35 for controlling accesses to memory. For example, a memory management unit (MMU) may be provided for providing address translation functionality to support memory accesses triggered by the load/store unit 35. The MMU may have a translation lookaside buffer (TLB) for caching a subset of entries from page tables stored in the memory system 45, 55, 60. Each page table entry may provide an address translation mapping for a corresponding page of addresses and may also specify access control parameters, such as access permissions specifying whether the page is a read only region or is both readable and writable, or access permissions specifying which privilege levels can access the page.

The instructions decoded by the decoder circuitry 10 are instructions from a first instruction set, and in accordance with the techniques described herein the first instruction set may include a program-specifying instruction. The program-specifying instruction specifies a memory location operand and a program operand, and in response to decoding such an instruction the decoder circuitry 10 issues control signals to the processing circuitry 15 to cause a second instruction set processing unit to be triggered to execute a program identified by the program operand. As a result, the second instruction set processing unit will then perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand. The program comprises one or more instructions of a second instruction set defining operations supported by the second instruction set processing unit.

By such an approach, a single instruction of the first instruction set can be used to implement a wide variety of custom operations on data held in memory. In particular, a program can be written using the instructions of the second instruction set, with that program being identified by one of the operands of the program-specifying instruction. In one example implementation, the program will be constrained to be relatively small, and in particular to fit within one, or a couple, of the registers 20, and hence the program can be read from the relevant register or registers and passed as an input to the second instruction set processing unit. The task of executing that program can then be offloaded to the second instruction set processing unit, which can then execute each of the instructions defined within the program in order to perform a sequence of operations on data accessed in memory.

Another benefit of specifying the required sequence of operations via a program operand of a program-specifying instruction of the first instruction set is that no hardware state needs to be modified (saved and restored) during processes such as context switches. The proposed approach is also virtualisation friendly. All architected state is located in memory or in the registers referred to by instructions of the first instruction set, so will be handled by the operating system and hypervisor software.

The above described mechanism can be used to implement a variety of operations on data held in memory rather than first loading that data into the registers 20 accessible to the processing circuitry 15. In one particular example implementation, as will be discussed herein with reference to the remaining figures, the use of the program-specifying instruction can be used to perform a sequence of operations atomically, such a sequence of operations also being referred to herein as an atomic operation. In such an implementation, then as shown in Figure 1 the second instruction set processing unit can take the form of an atomic processing unit 40. Whilst the atomic processing unit can be located in a variety of places within the apparatus, in the example shown in Figure 1 the atomic processing unit 40 is shown as being provided within the load/store unit 35, and is arranged to perform the atomic operation in respect of data held in the level 1 data cache 45. Such an implementation may be referred to as a "near" implementation since the data operated on by the atomic processing unit is held in a level of cache closely coupled to the processing circuitry 15, in this case the level 1 data cache 45 provided in association with the processing circuitry 15.

Figure 2 is a diagram schematically illustrating fields that may be provided within the earlier mentioned program-specifying instruction. In the implementation described with reference to the figures, the program that is specified by such an instruction is used to perform the earlier mentioned atomic operation, and hence the program-specifying instruction may be referred to as a user-defined atomic instruction. The instruction is user-definable, since the programmer may write the program whose execution is to be triggered in response to decoding the user-defined atomic instruction. As mentioned earlier, the program will be written using instructions of the second instruction set, and in one example implementation the second instruction set will take the form of a special-purpose instruction set containing a limited number of instructions that define only a core set of operations, but where that core set of operations can be combined in a variety of different ways to enable a wide variety of useful atomic operations to be performed by the atomic processing unit 40.

The user-defined atomic instruction 100 includes an opcode field 105 that is used to define the instruction as being the user-defined atomic instruction, and hence can be used to distinguish this instruction from other instructions of the first instruction set. A program operand field 110 is then used to specify a program operand. The program operand may identify the program to be executed in a variety of ways, but as mentioned earlier in one example implementation the program will have been stored into one, or a couple, of registers 20 accessible to the processing circuitry 15 prior to execution of the user-defined atomic instruction, and hence the relevant register or registers can be identified by the program operand field 110.

A further field 115 is used to identify a memory location operand, this providing sufficient information to identify a location in memory whose data is going to be operated on by the program when the program is executed. The memory location operand can take a variety of forms, but in one example may identify a register 20 whose stored data value can be used to identify the memory location, for example by using that data value as an offset to add to a base address in order to identify the memory address to be accessed. In another example, the memory location operand may identify a stack pointer that is used to identify the memory address to access.

In one example implementation, in addition to the program operand field 110 and the memory location operand field 115, one or more additional fields 120 may be provided to specify one or more input value operands. Each input value operand may be used to identify an input value to be used by the atomic processing unit 40 when performing the sequence of operations defined by the program. Such input values can be specified in a variety of ways, for example by being specified as immediate values within the instruction, or by being specified with reference to one or more registers whose contents provide those input values.

In one example implementation, the user-defined atomic (UDA) instruction can take one of the following forms: UDA <Ws>,<W(s+1)>,<W(s+2)>, <Xt>, [<XnISP>] or UDA cXs>,<X(s+1)>,<X(s+2)>, <Xt>, [<XnISP>] where up to three 32-bit (as indicated by the Ws, W(s+1) and W(s+2) labels) or three 64-bit (as indicated by the Xs, X(s+1) and X(s+2) labels) registers are used to specify input operands and output values, a 64-bit register Xt (or in some instances a register pair) is used to contain the user-defined program, and a 64-bit register Xn or the stack pointer SP is used to specify the targeted memory location.

More details of the atomic processing unit 40 illustrated in Figure 1 will now be described with reference to the block diagram of Figure 3. As discussed with reference to Figure 1, in that example implementation the atomic processing unit 40 is provided as part of the load/store unit 35, and is arranged to perform atomic operations on data held in the level 1 data cache 45 associated with the processing circuitry 15.

When the UDA instruction is decoded by the decoder circuitry 10 shown in Figure 1, this causes control signals to be issued to the processing circuitry 15. In response to those control signals, the processing circuitry 15 obtains the program from the relevant register or registers specified by the UDA instruction and forwards that program over path 175 to the atomic processing unit 40, where it is stored within the buffer 165 used by the decoder 150 of the atomic processing unit 40. Similarly, the processing circuitry 15 obtains the input operand data from any registers specified by the UDA instruction as providing input operands, and forwards that input operand data over path 180 to the atomic processing unit, where it is stored within one or more of the internal registers 155 provided within the atomic processing unit 40. In addition, the targeted memory location is identified with reference to the contents of the register Xn or the stack pointer SP, and the processing circuitry (in one example implementation the LSU 35) is then arranged to read the targeted memory location in order to retrieve the source data at that memory location, with that retrieved source data then being provided as another input to the atomic processing unit for storing within a specific internal register of the set of internal registers 155 within the atomic processing unit 40.

Thereafter, the decoder circuitry 150 is arranged to analyse the program as stored within the buffer 165 in order to identify each individual instruction of the second instruction set that is constituting the program. This process will be discussed in more detail later with reference to Figures 5 and 6. As the instructions are decoded by the decoder 150, control signals are issued to the execution circuitry 160 within the atomic processing unit 40 to cause the operations required by those instructions to be performed. During this process, the execution circuitry 160 will access the internal registers 155 as required in order to obtain input operands for the operations, and to store the output values generated during the performance of those operations. In one particular example implementation, the output values are stored back to the same registers that were used to provide the input operands, and any result data generated for storing back to memory is written back to the same internal register that was used to store the originally retrieved source data from memory.

At least some of the instructions in the second instruction set may be arranged to cause the values of condition code flags to be set in dependence on the outcome of the operations performed when executing those instructions, and one or more instructions may then be arranged to be conditionally executed, dependent on the values of one or more of the condition code flags at the time those instructions are encountered. The condition code flags can be saved within an internal storage 170 of the atomic processing unit 40 for reference by the decoder 150 and/or execution circuitry 160 during execution of the program.

Figure 4 is a flow diagram illustrating the steps taken when a UDA instruction is decoded. As discussed earlier, the UDA instruction is an instruction of the first instruction set, and will be decoded by the decoder circuitry 10 discussed earlier with reference to Figure 1. When such a UDA instruction is decoded at step 200, then at step 205 the registers containing the program, the memory location indication, and any input values are identified, and the processing circuitry 15 is triggered to read and provide that information to the atomic processing unit 40.

As indicated at step 210, the atomic processing unit stores any provided input values in internal registers, obtains the source data from the specified memory location (as discussed earlier, this could be done by the atomic processing unit itself, or by the LSU on the atomic processing unit's behalf) and stores that read source data in another internal register (typically there being a dedicated internal register for storing such source data). In addition, the program retrieved from the relevant register or registers of the set of registers 20 is stored in the instruction buffer 165 of the atomic processing unit 40.

At step 215, the atomic processing unit then decodes each instruction of the program, executing those instructions until the program has completed. During this process, the contents of the internal registers 155 will be updated with any result data and output values generated during the execution of the relevant instructions. At step 220, the atomic processing unit then writes the result data back to the originally specified memory location, and outputs as output values over path 185 (see Figure 3) the contents of the internal registers that were used to store the originally provided input values, so that those output values can be used to update the relevant CPU registers, i.e. the registers 20 that provided the input values.

The execution of the program by the atomic processing unit happens atomically with regards to memory, with the atomicity being enforced by the underlying hardware. For example, in a typical data processing system employing a cache hierarchy, with different processing units being able to share access to certain regions of memory, a cache coherency protocol can be used to ensure coherency in the data being accessed by the multiple processing units. When an atomic operation is to be performed in response to the earlier mentioned UDA instruction, then when the source data is accessed from the memory system, that source data can be placed within the level 1 data cache 45 and the associated processing circuitry 15 recorded as having exclusive access to that data. This exclusive state can be retained while all of the individual operations making up the atomic operation are performed, hence ensuring that no other processing unit can read or change the state of that data whilst the atomic operation is being performed.

Figure 5 schematically illustrates how the instruction buffer 165 is used by the decoder of the atomic processing unit 40, in accordance with one example implementation. The program as read from the relevant register or registers 20 is forwarded to the atomic processing unit 40 for storing within the instruction buffer 165. The decoder then analyses the contents of the instruction buffer, starting from a first end identifying the first instruction of the program. In one example implementation, the second instruction set is a variable length instruction set, and as a result the number of bits used to define any given instruction is dependent on the type of that instruction. Hence, starting with the first instruction in the sequence, the decoder will analyse the opcode portion to identify the type of instruction, and having done that will then determine the total number of bits forming the instruction. Those total number of bits can then be analysed in order to fully decode the instruction, for example to identify which of the internal registers contain source/input data to be used by that instruction, any condition that must be met in order for that instruction to be executed, etc. As each instruction is decoded, the bits forming that instruction are discarded, and a certain number of bits of a predetermined bit value are inserted into the instruction buffer. In one example implementation, the same number of bits are inserted as the number of bits forming the instruction that has just been decoded. The instruction buffer can be physically constructed in a variety of ways, but can be viewed as logically implementing a shift register, where each decoded instruction is shifted out of the register, and a corresponding number of bits of the predetermined bit value are shifted into the other end of the buffer.

As shown schematically in Figure 5, the boundaries between the instructions forming the program are identified as each instruction of the program is decoded, Figure 5 showing the boundaries formed between three instructions 230, 235 and 240. The predetermined bit value inserted as each instruction is decoded can take a variety of forms, but in one example form is a logic 0 value. As will be discussed in more detail below, this can give rise to a variety of benefits, and in particular can enable an end instruction identifying the end of the program to be implicitly added into the sequence of instructions, which can be useful for example if there was insufficient room in the register or registers specifying the program to explicitly include an end instruction.

It will be appreciated that the exact instructions that it is decided to include within the second instruction set can be varied dependent on implementation. Purely to illustrate a specific example, then the second instruction set may include the following instructions: 1. MOV regl,reg2 -move and set condition flags 2. ADD regl,reg2 -add and set condition flags 3. SUB regl,reg2 -subtract and set condition flags 4. CMP regl,reg2 -compare and set condition flags 5. SET regl,reg2 -set bits (bitwise or) and set condition flags 6. AND regl,reg2 -bitwise and set condition flags 7. FOR regl,reg2 -bitwise exclusive-or and set condition flags 8. SKIP cond,#count -conditionally skip instructions (when condition is met) 9. END (may be encoded as SKIP NEVER, especially if SKIP NEVER is encoded as 0b0000000) In one example implementation, the ALU instructions (except CMP) can be arranged to be destructive and overwrite the first register operand, thus avoiding the need to separately specify a destination operand.

In one example implementation, the internal registers 155 may comprise four 32 or 64-bit registers, which may be referred to herein as registers RO to R3. In one specific implementation, register RO is initialised with the value obtained from the targeted memory location, and its value is then written back to that memory location when execution of the program has completed. The registers R1 to R3 may be initialised from the scalar CPU registers specified in the UDA instruction, with those scalar CPU registers then being updated with the values of the registers R1 to R3 when execution of the program has completed. In addition to the above registers, an implicit four-bit condition flags register may be provided to capture the condition code flags NZCV (N is set if the result of an operation is negative, Z is set if the result of an operation is zero, C is set if an operation produced a carry (or borrow on a subtraction) and V is set if an operation produced an overflow).

With such a selection of instructions forming the second instruction set, and provision of the above-mentioned internal registers, then the following efficient encoding can be utilised for the instructions: 1. 3 bits to encode the operation; 2. 2 bits to encode a register specifier; 3. 4 bits to encode a condition -for example the standard 15 Arm condition codes (NEVER condition reserved); and 4. 3 bits to encode a count (number of instructions to skip) -allowing between 1 and 8 instructions to be skipped.

As a result, the instructions are either 7 bits (ALU instructions) or 10 bits (SKIP) long, allowing up to 9 to 10 instructions in a 64-bit program. As mentioned earlier, when an instruction has been decoded, the instruction stream (program) is shifted the corresponding number of bits (7 or 10) and in one example implementation an equivalent number of logic zero values are shifted in. Execution of the program ends when only zeroes remain in the instruction stream. As discussed earlier, in some implementations a final explicit END instruction may be omitted, for example if the end instruction is encoded as all zeros. In one particular implementation this is achieved by reusing the reserved encoding for the skip instruction. In particular, in one example implementation, the opcode for the skip instruction is "000" and the reserved condition code is "0000" so if an encoding of "0000000" is encountered, then rather than being interpreted as a skip instruction with the condition of "0000" the instruction is interpreted as an end instruction.

The use of such an instruction set is illustrated schematically in Figure 6, where the instruction buffer 165 contains the sequence of bits 250, 260, 270, 280 followed by a series of logic zero values making a total of 64 bits. The three bit encodings of each instruction are indicated in the table 245. Hence, the first instruction sequence of bits 250 is interpreted as being a subtract instruction 255 with the registers R2 and RO specified as operands. The next instruction sequence of bits 260 is interpreted as a skip instruction 265. The four condition code bits "cccc" can take any valid condition, and hence any sequence of four bits other than "0000" (which as discussed earlier is the reserved condition). In the particular example shown, it is assumed that the condition code specifies the "equal" condition, and the remaining three bits of the instruction identify that one instruction should be skipped if the condition is met.

The next instruction sequence of bits 270 identify an add instruction 275, with the registers RO and R1 specified as operands. The final instruction sequence of bits 280 is then determined to be the sequence "0000000", and hence as discussed earlier is interpreted as an end instruction 285 enabling the program to terminate early without the need to decode any of the remaining bits in the instruction buffer 165.

As mentioned earlier, the example arrangement of Figure 1 shows a near implementation of the atomic processing unit 40. However, the techniques described herein are not limited to use in association with such a near implementation. For example, in an alternative example implementation, the processing circuitry may be arranged to be coupled to the atomic processing unit via an interconnect, and to trigger the atomic processing unit to execute the program by asserting a request over the interconnect to cause the atomic processing unit to perform the sequence of operations defined by the program on data held in a level of cache associated with the atomic processing unit. In this case, the atomic processing unit may be located further out in the system, in association with a level of cache more remote from the processing circuitry, for example a level of cache shared between multiple instances of processing circuitry within the system. Such an implementation may be referred to as a "far" implementation, and an example of such an arrangement is shown in Figure 7.

In this example, three processing units 300, 305, 310 are provided, each with their own LSU 325, 330, 335, respectively, and each having its own associated level 1 data cache 327, 332, 337, respectively. Each of the processing units is connected to main memory 320 via a system interconnect 315, and the system interconnect may have one or more further levels of cache associated therewith, to cache data that can then be accessed by any of the processing units. In the example shown, there is a system cache 345, and access to that system cache is controlled by an entity 340 referred to herein as the home node. The home node can be arranged to implement a cache coherency protocol in order to ensure that a coherent view of data is maintained within the system, and as will be appreciated by those of ordinary skill in the art the home node may hence be arranged to issue snoop requests to the level 1 caches 327, 332, 337 when implementing the cache coherency protocol to check what data is cached by any local level 1 cache, and to cause coherency actions to be taken in dependence on those checks. For example, the processor 300 may wish to load a cache line's worth of data into its level 1 data cache 327 so that it can then operate on that data. A request for that data may be issued onto the system interconnect 315, causing the home node 340 to instigate some cache coherency actions to ensure that the most up-to-date version of the data is retrieved into the level 1 data cache 327, and any other processing unit's locally cached copy is either invalidated or marked to identify that a coherency check should be made if an attempt is made by any of those processing units to access that data, to ensure that at that time the most up-to-date version of the data is retrieved.

In accordance with the example shown in Figure 7, an atomic processing unit 350 may be associated with the home node 340, for performing atomic operations on data held in the associated system cache 345. In some instances, it may be more efficient to operate on the data in the system cache 345, since this can reduce the level of snooping required when implementing the cache coherency protocol, and reduce the extent to which data is moved around between caches when compared with an approach where that data is held in a cache closely coupled to one of the processing units. Operating on the data in the system cache 345 may hence result in an aggregated update rate (under contention) being higher.

When adopting such a far implementation, if the decoder in one of the processing units 300, 305, 310 decodes the earlier mentioned UDA instruction, then the required program data, memory location information and any user operand data can be gathered from the relevant registers and then output within a request issued over the system interconnect 315 to the atomic processing unit 350. The atomic processing unit 350 can then execute the program in the same way as described earlier with reference to the atomic processing unit 40, but in this instance may operate on source data as held in the system cache 345, and any generated result data can then be written back to that system cache to overwrite the source data.

In some example implementations, the system may provide both near and far implementations of the atomic processing unit, with some atomic operations being performed using the near implementation and some atomic operations being performed using the far implementation. A variety of factors may be taken into consideration when deciding whether to use the near or far implementation for any particular atomic operation. For example, a consideration of the amount of contention for access to the source data by multiple instances of the processing units may be considered when deciding whether to execute the program using the near implementation or the far implementation.

The techniques described herein can be used to enable a wide variety of operations to be performed atomically. Purely to illustrate some example use cases, three examples of programs that could be written using the specific second instruction set discussed earlier, with those programs then being specified as a program operand of the UDA instruction, will now be discussed.

Firstly, an atomic add unless function of the following form is considered: int atomic_add_unless(v, a, u) { if (v!= u) { v += a, return Non-zero; else //v == u return 0; 15} When using the techniques described herein the following inputs and outputs could be specified: Inputs: RO v Memory operand read from the location R1 a Register input operand R2 u Register input operand R3 Register input operand (not used) Outputs: RO v Memory operand written to the location R1 a Register output value (unchanged) R2 res Register output value R3 Register output value (unchanged) The following program could then be written using the earlier-discussed second instruction set to perform the above function: //Subtract v from u SUB R2,R0 (7 bits) //Skip 1 instruction if diff == 0 SKIP EQ,#1 (10 bits) //v!= u (R2!= 0), add a to v ADD R0,R1 (7 bits) //Else v == u (R2 == 0) //End program (fill remaining bits with zero)

END

The resultant program size is hence 24 bits, which can readily be accommodated within a single register.

As a second example, an atomic ring buffer acquire function may be considered, such a function often being used by enqueue and dequeue operations. The following inputs and outputs could be specified: Inputs: RO Tail Memory operand read from the location R1 HeadCap Register input operand: Head index + ring buffer capacity (support R2 Requested R3 Minimum The following both enqueue and dequeue) Register input operand Register input operand Memory operand written to the location Register output value: software must compare returned Actual with Minimum and find out if operation succeeded (Actual>=Minimum) Register output value (unchanged) Register output value (unchanged) program could then be written using the earlier-discussed second R2 Requested R3 Minimum Outputs: RO Tail R1 Actual instruction set to perform the atomic ring buffer acquire function: //Subtract Tail:RO from HeadCap:R1, store into Actual:R1 SUB R1,R0 (7 bits) //Compare Actual with Requested //Use Requested if Requested < Actual CMP R1,R2 (7 bits) //Skip 1 instruction if diff > 0 SKIP GT,#1 (10 bits) MOV R1,R2 (7 bits) //Compare Actual with Minimum CMP R1,R3 (7 bits) //Bail out if Actual<Minimum //Skip 1 instruction if diff < 0 SKIP LT,#1 (10 bits) //Add actual to tail ADD RO,R1 (7 bits) //End program (fill remaining bits with zero)

END

The resultant program size is hence 55 bits.

As a third example, a Golang runtime function of the following form could be used for garbage collection (see littps://q ithu corolo ol am:I/go/blob/5 b72f45dd17314 af39627c2fcacOfbc099 b67603/sre/ru r rr elmqcsvveep,ao#L169-L184): for { state:= a.state.Load() if (state&AsweepDrainedMask)-1 >= sweepDrainedMask { throw("mismatched begin/end of activeSweep") if a.state.CompareAndSwap(state, state-1) { return; The following inputs and outputs could be specified: RO state Memory operand read from the location R1 -sweepDrainedMask Register input operand R2 constO Register input operand R3 sweepDrainedMask Register input operand Outputs: RO state Memory operand written to the location R1 result Register output value: (state&AsweepDrainedMask)-1 R2 constO Register output value (unchanged) R3 sweepDrainedMask Register output value (unchanged) The following program could then be written using the earlier-discussed second instruction set to perform the above function: //R1 = RO & -sweepDrainedMask AND R1,R0 (7 bits) //R1 -= 1 SUB R1,R2 (7 bits) //Compare R1 against sweepDrainedMask CMP R1,R3 (7 bits) //If R1 >= sweepDrainedMask then skip update SKIP GE,#1 (10 bits) //R0 -= 1 SUB R0,R2 (7 bits) //End program (fill remaining bits with zero)

END

The resultant program size is hence 38 bits.

Concepts described herein may be embodied in computer-readable code for fabrication of an apparatus that embodies the described concepts. For example, the computer-readable code can be used at one or more stages of a semiconductor design and fabrication process, including an electronic design automation (EDA) stage, to fabricate an integrated circuit comprising the apparatus embodying the concepts. The above computer-readable code may additionally or alternatively enable the definition, modelling, simulation, verification and/or testing of an apparatus embodying the concepts described herein.

For example, the computer-readable code for fabrication of an apparatus embodying the concepts described herein can be embodied in code defining a hardware description language (HDL) representation of the concepts. For example, the code may define a registertransfer-level (RTL) abstraction of one or more logic circuits for defining an apparatus embodying the concepts. The code may define a HDL representation of the one or more logic circuits embodying the apparatus in Verilog, SystemVerilog, Chisel, or VHDL (Very High-Speed Integrated Circuit Hardware Description Language) as well as intermediate representations such as FIRRTL. Computer-readable code may provide definitions embodying the concept using system-level modelling languages such as SystemC and SystemVerilog or other behavioural representations of the concepts that can be interpreted by a computer to enable simulation, functional and/or formal verification, and testing of the concepts.

Additionally or alternatively, the computer-readable code may define a low-level description of integrated circuit components that embody concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GOSH. The one or more netlists or other computer-readable representation of integrated circuit components may be generated by applying one or more logic synthesis processes to an RTL representation to generate definitions for use in fabrication of an apparatus embodying the invention. Alternatively or additionally, the one or more logic synthesis processes can generate from the computer-readable code a bitstream to be loaded into a field programmable gate array (FPGA) to configure the FPGA to embody the described concepts. The FPGA may be deployed for the purposes of verification and test of the concepts prior to fabrication in an integrated circuit or the FPGA may be deployed in a product directly.

The computer-readable code may comprise a mix of code representations for fabrication of an apparatus, for example including a mix of one or more of an RTL representation, a netlist representation, or another computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus embodying the invention. Alternatively or additionally, the concept may be defined in a combination of a computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus and computer-readable code defining instructions which are to be executed by the defined apparatus once fabricated.

Such computer-readable code can be disposed in any known transitory computer- readable medium (such as wired or wireless transmission of code over a network) or non-transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc. An integrated circuit fabricated using the computer-readable code may comprise components such as one or more of a central processing unit, graphics processing unit, neural processing unit, digital signal processor or other components that individually or collectively embody the concept.

Figure 8 illustrates a simulator implementation that may be used. Whilst the earlier described embodiments implement the present invention in terms of apparatus and methods for operating specific processing hardware supporting the techniques concerned, it is also possible to provide an instruction execution environment in accordance with the embodiments described herein which is implemented through the use of a computer program. Such computer programs are often referred to as simulators, insofar as they provide a software based implementation of a hardware architecture. Varieties of simulator computer programs include emulators, virtual machines, models, and binary translators, including dynamic binary translators.

Typically, a simulator implementation may run on a host processor 430, optionally running a host operating system 420, supporting the simulator program 410. In some arrangements, there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and/or multiple distinct instruction execution environments provided on the same host processor. Historically, powerful processors have been required to provide simulator implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstances, such as when there is a desire to run code native to another processor for compatibility or re-use reasons. For example, the simulator implementation may provide an instruction execution environment with additional functionality which is not supported by the host processor hardware, or provide an instruction execution environment typically associated with a different hardware architecture. An overview of simulation is given in "Some Efficient Architecture Simulation Techniques", Robert Bedichek, Winter 1990 USENIX Conference, Pages 53 -63.

To the extent that embodiments have previously been described with reference to particular hardware constructs or features, in a simulated embodiment, equivalent functionality may be provided by suitable software constructs or features. For example, particular circuitry may be implemented in a simulated embodiment as computer program logic. Similarly, memory hardware, such as a register or cache, may be implemented in a simulated embodiment as a software data structure. In arrangements where one or more of the hardware elements referenced in the previously described embodiments are present on the host hardware (for example, host processor 430), some simulated embodiments may make use of the host hardware, where suitable.

For example, the simulator code 410 may include instruction decoding program logic 412 to decode instructions in the target code -hence, the instruction decoding program logic may emulate the instruction decoder 10 described earlier. The simulator code 410 may also include register emulating program logic 414 to emulate the registers 20 described above. The simulator program also includes data processing program logic 416 to process instructions in the target code 400 (and hence emulate processing circuitry 15) as well as atomic processing program logic 418 to emulate the atomic processing unit 40.

The simulator program 410 may be stored on a computer-readable storage medium (which may be a non-transitory medium), and provides a program interface (instruction execution environment) to the target code 400 (which may include applications, operating systems and a hypervisor) which is the same as the interface of the hardware architecture being modelled by the simulator program 410. Thus, the program instructions of the target code 400, including the program-specifying instruction (for example UDA instruction) described above, may be executed from within the instruction execution environment using the simulator program 410, so that a host computer 430 which does not actually have the hardware features of the apparatus 5 discussed above can emulate these features.

Accordingly, the simulator code 410 is an example of a computer program comprising instructions which, when executed by a host data processing apparatus, control the host data processing apparatus to provide an instruction execution environment for executing target program code, the computer program comprising: instruction decoding program logic to decode instructions of a first instruction set, wherein the instruction decoding program logic is responsive to instructions of the first instruction set to generate control signals; and data processing program logic responsive to the control signals to cause operations defined by the instructions to be performed; wherein: the instruction decoding program logic is arranged to be responsive to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand to issue control signals to the data processing program logic to cause second instruction set processing program logic to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand, the program comprising one or more instructions of the second instruction set defining operations supported by the second instruction set processing program logic.

In the present application, the words "configured to..." are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a "configuration" means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. "Configured to" does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.

In the present application, lists of features preceded with the phrase "at least one of mean that any one or more of those features can be provided either individually or in combination. For example, "at least one of: [A], [B] and [C]" encompasses any of the following options: A alone (without B or C), B alone (without A or C), C alone (without A or B), A and B in combination (without C), A and C in combination (without B), B and C in combination (without A), or A, B and C in combination.

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.

Claims

CLAIMS1. An apparatus comprising: decoder circuitry to decode instructions of a first instruction set, wherein the decoder circuitry is responsive to instructions of the first instruction set to generate control signals; and processing circuitry responsive to the control signals to cause operations defined by the instructions to be performed; wherein: the decoding circuitry is arranged to be responsive to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand to issue control signals to the processing circuitry to cause a second instruction set processing unit to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand, the program comprising one or more instructions of the second instruction set defining operations supported by the second instruction set processing unit.
2. An apparatus as claimed in Claim 1, wherein the program-specifying instruction is an atomic instruction, the second instruction set processing unit is an atomic processing unit, and the sequence of operations are arranged to be performed atomically.
3. An apparatus as claimed in Claim 1 or Claim 2, further comprising: a set of registers accessible to the processing circuitry; and the program-specifying instruction is configured to specify the program operand by identifying at least one register from the set of registers containing the program to be executed by the second instruction set processing unit.
4. An apparatus as claimed in any preceding claim, wherein the program-specifying instruction further specifies at least one input value operand identifying at least one input value to be used by the second instruction set processing unit when performing the sequence of operations defined by the program.
5. An apparatus as claimed in any preceding claim, wherein each instruction of the second instruction set is encoded using less bits than each instruction of the first instruction 35 set.
6. An apparatus as claimed in any preceding claim, wherein the second instruction set is a variable length instruction set such that a number of bits used to encode any given instruction of the second instruction set is dependent on a type of that given instruction.
7. An apparatus as claimed in any preceding claim, wherein: the apparatus is arranged to be coupled to a memory system providing multiple levels of cache; the second instruction set processing unit is provided by the processing circuitry; and the second instruction set processing unit is arranged to perform the sequence of operations on data held in a level of cache accessible to the processing circuitry.
8. An apparatus as claimed in any of claims 1 to 6, wherein: the apparatus is arranged to be coupled to a memory system providing multiple levels of cache; the processing circuitry is arranged to be coupled to the second instruction set processing unit via an interconnect, and to trigger the second instruction set processing unit to execute the program by asserting a request over the interconnect to cause the second instruction set processing unit to perform the sequence of operations on data held in a level of cache associated with the second instruction set processing unit.
9. An apparatus as claimed in any preceding claim, wherein the second instruction set processing unit comprises: instruction buffer storage to receive the program identified by the program operand; further decoding circuitry to analyse the content of the instruction buffer storage in order to decode each instruction of the second instruction set provided by the program; and execution circuitry to perform the operations required by each instruction in response to control signals generated by the further decoding circuitry.
10. An apparatus as claimed in Claim 9, wherein the program is formed of a series of bits, and the further decoding circuitry is arranged, for each instruction of the second instruction set provided by the program, to determine from opcode bits the type of instruction, and to determine a total number of bits defining that instruction in dependence on the determined type of instruction.
11. An apparatus as claimed in Claim 10, wherein the second instruction set includes an end instruction that is used to identify when an end of the program has been reached, and the end instruction is encoded as a sequence of bits that all have a same predetermined bit value.
12. An apparatus as claimed in Claim 11, wherein as each instruction is decoded by the further decoding circuitry, the second instruction set processing unit is arranged to extend the series of bits by appending a number of bits of the predetermined value.
13. An apparatus as claimed in any of claims 10 to 12, wherein the second instruction set includes a skip instruction that is used to cause one or more subsequent instructions in the program to be skipped when a condition defined by the skip instruction is met.
14. An apparatus as claimed in Claim 13, wherein at least one instruction in the second instruction set is arranged, when executed, to cause one or more condition code flags to be set in dependence on a result generated by execution of that instruction, and the execution circuitry is arranged to determine whether the condition defined by the skip instruction is met with reference to a state of the one or more condition code flags.
15. An apparatus as claimed in Claim 13 or Claim 14, wherein: the condition is defined by a condition field within the encoding of the skip instruction; the second instruction set includes an end instruction that is used to identify when an end of the program has been reached; and the end instruction is encoded with identical opcode bits as are used to identify the skip instruction, but with the condition field set to a reserved value not used to define a condition.
16. An apparatus as claimed in any preceding claim, wherein: the second instruction set processing unit comprises a plurality of internal registers; when triggered to perform the sequence of operations, the second instruction set processing unit is arranged to store in a given internal register source data obtained from the location in memory identified by the memory location operand; and on completion of the sequence of operations the second instruction set processing unit is arranged, at least when result data differs from the source data, to write the result data to the location in memory identified by the memory location operand.
17. An apparatus as claimed in Claim 16, wherein the program-specifying instruction specifies at least one input value operand identifying at least one input value to be used by the second instruction set processing unit when performing the sequence of operations defined by the program, and the processing circuitry is arranged to provide the at least one input value to the second instruction set processing unit for storing in at least one internal register of the plurality of internal registers other than the given internal register.
18. An apparatus as claimed in Claim 17, wherein: the program-specifying instruction is configured to specify the at least one input value operand by identifying at least one register accessible to the processing circuitry that contains the at least one input value; the processing circuitry is arranged to read the at least one input value from the at least one register for provision to the second instruction set processing unit; and the second instruction set processing unit is arranged during performance of the sequence of operations to maintain at least one output value, and on completion of the sequence of operations is arranged to provide the at least one output value to the processing circuitry for storing in the at least one register that contained the at least one input value.
19. An apparatus as claimed in Claim 18, wherein during performance of the sequence of operations the second instruction set processing unit is arranged to cause the at least one output value to be maintained in the at least one internal register of the plurality of internal registers used to store the at least one input value.
20. A method of operating an apparatus, comprising: employing decoder circuitry to decode instructions of a first instruction set, wherein the decoder circuitry is responsive to instructions of the first instruction set to generate control signals; employing processing circuitry to be responsive to the control signals to cause operations defined by the instructions to be performed; and arranging the decoding circuitry to issue, in response to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand, control signals to the processing circuitry to cause a second instruction set processing unit to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand; wherein the program comprises one or more instructions of the second instruction set defining operations supported by the second instruction set processing unit.
21. A computer program comprising instructions which, when executed by a host data processing apparatus, control the host data processing apparatus to provide an instruction execution environment for executing target program code, the computer program comprising: instruction decoding program logic to decode instructions of a first instruction set, wherein the instruction decoding program logic is responsive to instructions of the first instruction set to generate control signals; and data processing program logic responsive to the control signals to cause operations defined by the instructions to be performed; wherein: the instruction decoding program logic is arranged to be responsive to a program-specifying instruction of the first instruction set that specifies a memory location operand and a program operand to issue control signals to the data processing program logic to cause second instruction set processing program logic to be triggered to execute a program identified by the program operand in order to perform a sequence of operations defined by the program on data accessed at a location in memory identified by the memory location operand, the program comprising one or more instructions of the second instruction set defining operations supported by the second instruction set processing program logic.
22. A computer-readable medium to store computer-readable code for fabrication of the apparatus of any of claims 1 to 19.