This is a cpu tool for benchmarking the peak performance of floating-points and AI ISAs.
It can automatically sense the local SIMD|DSA ISAs while compiling.
Arch | Linux | MacOS | Windows |
---|---|---|---|
arm64 | yes | no | no |
e2k | yes | no | no |
loongarch64 | yes | no | no |
riscv64 | yes | no | no |
x86-64 | yes | no | no |
Arch | ISA | Feature | Data Type | Description |
---|---|---|---|---|
SIMD | SSE | Vector | fp32 | Before Sandy Bridge |
SIMD | SSE2 | Vector | fp64 | Before Sandy Bridge |
SIMD | AVX | Vector | fp32/fp64 | From Sandy Bridge |
SIMD | FMA | Vector | fp32/fp64 | From Haswell/Zen |
SIMD | AVX512f | Vector | fp32/fp64 | From Skylake X/Zen4 |
SIMD | AVX512_VNNI | Vector | int8/int16 | From IceLake |
SIMD | AVX_VNNI | Vector | int8/int16 | From Alder Lake |
SIMD | AVX512_FP16 | Vector | fp16 | From Intel Sapphire Rapids |
SIMD | AVX512_BF16 | Vector | bf16 | From AMD Zen4 |
SIMD | AVX_VNNI_INT8 | Vector | int8 | Unknown |
DSA | AMX_INT8 | Matrix | int8 | From Intel Sapphire Rapids |
DSA | AMX_BF16 | Matrix | bf16 | From Intel Sapphire Rapids |
Arch | ISA | Feature | Data Type | Description |
---|---|---|---|---|
SIMD | asimd | Vector | fp32/fp64 | From Cortex-A57/A53 |
SIMD | asimd_hp | Vector | fp16 | From Cortex-A75/A55 |
SIMD | asimd_dp | Vector | int8 | From Cortex-A75/A55 |
SIMD | bf16 | Matrix | bf16 | From Cortex-X2/A710/A510 |
SIMD | i8mm | Matrix | int8 | From Cortex-X2/A710/A510 |
Arch | ISA | Feature | Data Type | Description |
---|---|---|---|---|
SIMD | V | Vector | fp16/fp32/fp64 | From RISC-V "V" vector extension. Version 1.0 |
DSA | ime | Matrix | int8 | From SpacemiT-X60 |
NOTE: ime is a SpacemiT custom vendor extension.
Arch | ISA | Feature | Data Type | Description |
---|---|---|---|---|
SIMD | LASX | Vector | fp32/fp64 | From Loongson 3A5000 |
SIMD | LSX | Vector | fp32/fp64 | From Loongson 3A5000 |
Scalar | FP | Scalar | fp32/fp64 | From Loongson 3A5000 |
Arch | ISA | Feature | Vector Width | Data Type | Description |
---|---|---|---|---|---|
SIMD | v6 | Vector | 128 | fp32/fp64 | FMA |
SIMD | v5 | Vector | 128 | fp32/fp64 | Combined operations |
Scalar | v1-v4 | Scalar | fp64 | Combined operations | |
SIMD | v1-v4 | Vector | 64 | fp32 | Combined operations |
E2K has support for instructions that perform two independant operations. It is like FMA, but with additional rounding as these operations is independant.
fmul_addd src1, src2, src3, dst
Multiply double-precision (64-bit) floating-point values from src1
and src2
,
and add the intermediate result to value from src3
. Store the result in dst
.
dst[63:0] := src3[63:0] + src1[63:0] * src2[63:0]
Architecture | Latency | Throughput (CPI) | ALC |
---|---|---|---|
elbrus-v4 | 8 | 0.16 | 012345 |
elbrus-v1 | 8 | 0.25 | 01-34- |
- ALC (Arithmetic Logic Complex/Channel) is an execution port for RISC-like instructions
build x64 version:
./build_x64.sh
build arm64 version:
./build_arm64.sh
build riscv64 version:
./build_riscv64.sh
build loongarch64 version:
./build_loongarch64.sh
build e2k version:
./build_e2k.sh
clean:
./clean.sh
./cpufp --thread_pool=[xxx] --idle_time=yyy
--thread_pool: [xxx] is the list of cpu thread to benchmarking, from setting affinities. Please reference the result of lstopo command. For example, [0,3,5-8,13-15].
--idle_time: the interval time(sec) between any two adjacent benchmarks, default is 0.
x86-64 cpufp benchmark results
riscv64 cpufp benchmark results
loongarch64 cpufp benchmark results
Add armv9(SVE, SVE2 & SME) Supports.