You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This patch extends the non-tensor TMA Bulk Copy Op
(from shared_cta to global) with an optional
byte mask operand. This mask helps in selectively
copying a particular byte to the destination.
* lit tests are added to verify the lowering to
the intrinsics.
Signed-off-by: Durgadoss R
8000
<durgadossr@nvidia.com>
let summary = "Async bulk copy from Shared CTA memory to Global memory";
2604
2604
let description = [{
2605
2605
Initiates an asynchronous copy operation from Shared CTA memory to
2606
-
global memory.
2606
+
global memory. The 32-bit operand `size` specifies the amount of
2607
+
memory to be copied, in terms of number of bytes. `size` must be a
2608
+
multiple of 16. The `l2CacheHint` operand is optional, and it is used
2609
+
to specify cache eviction policy that may be used during the memory
2610
+
access. The i-th bit in the 16-bit wide `byteMask` operand specifies
2611
+
whether the i-th byte of each 16-byte wide chunk of source data is
2612
+
copied to the destination. If the bit is set, the byte is copied.
2607
2613
2608
-
The `l2CacheHint` operand is optional, and it is used to specify cache
2609
-
eviction policy that may be used during the memory access.
2610
-
2611
2614
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk)
0 commit comments