8000 Inductor logging + analysis of torch.profile by exclamaforte · Pull Request #149697 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Inductor logging + analysis of torch.profile #149697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
704 changes: 704 additions & 0 deletions test/inductor/test_analysis.py

Large diffs are not rendered by default.

59 changes: 59 additions & 0 deletions test/profiler/test_profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import torch.optim
import torch.utils.data
from torch._C._profiler import _ExperimentalConfig, _ExtraFields_PyCall
from torch._inductor.ir import FixedLayout
from torch.autograd.profiler import KinetoStepTracker, profile as _profile
from torch.autograd.profiler_legacy import profile as _profile_legacy
from torch.profiler import (
Expand Down Expand Up @@ -2998,6 +2999,64 @@ def validate_json(prof):
assert "Overload Name" in key_averages.table()
validate_json(prof)

@unittest.skipIf(not torch.cuda.is_available(), "CUDA is required")
# this tests to see if we can only use a Triton backend for max autotune
@unittest.skipIf(
torch.cuda.is_available()
and not torch._inductor.utils.use_triton_template(
FixedLayout(torch.device("cuda"), torch.float16, [400, 800])
),
"Solo triton backend not possible",
)
Comment on lines +3006 to +3010
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this is a bit indirect. Could we just do the same checks we do in test_max_autotune that checks if we can run it ? see is_big_gpu. I dont like reaching into implementation details when it doesnt add any benefit.

def test_profiler_debug_autotuner(self):
"""
This test makes sure that profiling events will be present when the kernel is run using the DebugAutotuner.
"""
in1 = torch.randn((400, 600), device="cuda", dtype=torch.float16)
in2 = torch.randn((600, 800), device="cuda", dtype=torch.float16)

Comment on lines +3015 to +3017
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make tensors aligned so we wont do padding

def mm():
return torch.mm(in1, in2)

pb_mm = torch.compile(
mm,
options={
"benchmark_kernel": True,
"max_autotune": True,
"max_autotune_gemm_backends": "TRITON",
"profile_bandwidth": True,
},
)
comp_mm = torch.compile(
mm,
options={
"benchmark_kernel": True,
"max_autotune": True,
"max_autotune_gemm_backends": "TRITON",
},
)

with profile() as prof1:
pb_mm()
with profile() as prof2:
comp_mm()

def names(prof):
return {
ev.name
for ev in prof.events()
if "mm" in ev.name or "triton" in ev.name
}

trace1 = "/tmp/trace1_pb.json"
trace2 = "/tmp/trace2_nopb.json"
prof1.export_chrome_trace(trace1)
prof2.export_chrome_trace(trace2)
Comment on lines +3053 to +3054
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this test ?


n1 = names(prof1)
n2 = names(prof2)
self.assertEqual(n1, n2)


if __name__ == "__main__":
run_tests()
1 change: 1 addition & 0 deletions test/test_flop_counter.py
Original file line number Diff line number Diff line change
Expand Up @@ -854,5 +854,6 @@ def test_scaled_mm(self):

self.assertExpectedInline(get_total_flops(mode), """860160""")


if __name__ == "__main__":
run_tests()
2 changes: 2 additions & 0 deletions torch/_inductor/analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# `torch._inductor.analysis`
Contains scripts for inductor performance analysis.
Empty file.
150 changes: 150 additions & 0 deletions torch/_inductor/analysis/device_info.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
from dataclasses import dataclass
from logging import info
from typing import Optional

import torch


@dataclass(frozen=True)
class DeviceInfo:
"""
Theoretical Numbers from data sheet. If two numbers are given, Tensor/Matrix Core vs not,
then the higher number is reported. Sparsity is not considered.


Bandwidth numbers are tricky, because there are platform differences that may not show up in the profiler trace.
For example,
"""

tops: dict[torch.dtype, float]
dram_bw_gbs: float
dram_gb: float


# Indexing is based on `torch.cuda.get_device_name()`
# TODO investigate profiler support for tf32 and allow device to report correct number when it's turned on.
_device_mapping: dict[str, DeviceInfo] = {
Comment on lines +24 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file an issue for this as a follow up ? It's not great that we are not doing this programatically..

# Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
"NVIDIA H100": DeviceInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we distinguish between

H100 SXM and H100 NVL ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on torch.cuda.get_device_name(), which is stored by the profiler and available at runtime too. I'm not sure how to distinguish them, even at runtime.
Some ideas:

>>> torch.cuda.get_device_properties()
_CudaDeviceProperties(name='NVIDIA H100', major=9, minor=0, total_memory=97272MB, multi_processor_count=132, uuid=6efc17fa-5b7e-0452-613b-df241e45f2b8, L2_cache_size=60MB)
>>> torch.cuda.mem_get_info()
(99949740032, 101997215744)

tops={
torch.float64: 9.7,
torch.float32: 19.5,
torch.bfloat16: 1979.0,
torch.float16: 1979.0,
torch.float8_e8m0fnu: 3958.0,
torch.float8_e8m0fnu: 3958.0,
torch.float8_e4m3fnuz: 3958.0,
Comment on lines +28 to +36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the fbcode servers are clock rate limited. So, the numbers will be off for those.. I think this is actually somewhat important for getting accurate numbers.

cc @bertmaher - who did similar analysis here - https://fb.workplace.com/groups/420659799592399/posts/761265522198490/

How would you adjust for clock rate ? Is something simple like current_clock_rate/default sufficient ? I dont have a good sense of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interested in Bert's opinion too, I would think that current_clock_rate/default sufficient would be fine considering that most of the flops calculation are just clockrate * core count * flops per core.

torch.float8_e5m2: 3958.0,
torch.float8_e5m2fnuz: 3958.0,
torch.float8_e8m0fnu: 3958.0,
torch.int8: 3958.0,
},
dram_bw_gbs=3350,
dram_gb=80,
),
# Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
"NVIDIA A100": DeviceInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I saw that, I'm not sure how we solve in general for bw (flops seems fine). Like on 8x machine, the interconnect bw could be more important than dram bw.

tops={
torch.float64: 19.5,
torch.float32: 19.5,
torch.bfloat16: 312.5,
torch.float16: 312.5,
# Not in datasheet: float8
torch.int8: 624.0,
},
dram_bw_gbs=2039.0,
dram_gb=80.0,
),
# Source: https://resources.nvidia.com/en-us-gpu-resources/l4-tensor-datasheet
"NVIDIA L4": DeviceInfo(
tops={
# This is a guess, not in datasheet
torch.float64: 15.1,
torch.float32: 30.3,
torch.bfloat16: 242.0,
torch.float16: 242.0,
torch.float8_e8m0fnu: 485.0,
torch.float8_e8m0fnu: 485.0,
torch.float8_e4m3fnuz: 485.0,
torch.float8_e5m2: 485.0,
torch.float8_e5m2fnuz: 485.0,
torch.float8_e8m0fnu: 485.0,
torch.int8: 485.0,
},
dram_bw_gbs=3350,
dram_gb=24,
),
# Source: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300a-data-sheet.pdf
"AMD MI300A": DeviceInfo(
tops={
torch.float64: 122.6,
torch.float32: 122.6,
# torch.tf32: 490.3,
torch.bfloat16: 980.6,
torch.float16: 980.6,
torch.float8_e8m0fnu: 1961.2,
torch.float8_e8m0fnu: 1961.2,
torch.float8_e4m3fnuz: 1961.2,
torch.float8_e5m2: 1961.2,
torch.float8_e5m2fnuz: 1961.2,
torch.float8_e8m0fnu: 1961.2,
torch.int8: 1961.2,
},
dram_bw_gbs=5300.0,
dram_gb=128.0,
),
# Source: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf
"AMD MI300X": DeviceInfo(
tops={
torch.float64: 163.4,
torch.float32: 163.4,
torch.bfloat16: 1307.4,
torch.float16: 1307.4,
torch.float8_e8m0fnu: 2614.9,
torch.float8_e8m0fnu: 2614.9,
torch.float8_e4m3fnuz: 2614.9,
torch.float8_e5m2: 2614.9,
torch.float8_e5m2fnuz: 2614.9,
torch.float8_e8m0fnu: 2614.9,
torch.int8: 2614.9,
},
dram_bw_gbs=5300.0,
dram_gb=192.0,
),
}


def lookup_device_info(name: str) -> Optional[DeviceInfo]:
"""
Problem: when diffing profiles between amd and nvidia, we don't have access to the device information
of the other one. Also, since the analysis is static, we should be able to do it on another device unrelated
to the recorded device. Therefore, _device_mapping statically contains the information for lots of devices.
If one is missing, please run DeviceInfo.get_device_info() and add it to _device_mapping.
name (str): name of the device to lookup. Should map onto torch.cuda.get_device_name().
"""
if name not in _device_mapping:
return None
return _device_mapping[name]


def datasheet_tops(dtype: torch.dtype) -> Optional[float]:
"""
Get the theoretical TFLOPS of the device for a given dtype. This can throw an exception if the device
is not in the datasheet list above.
"""
name: Optional[str] = torch.cuda.get_device_name()
if name is None:
info("No device found, returning None")
return None
device_info = lookup_device_info(name)
if device_info is None:
log_str = f"Device {name} not in datasheet, returning None"
info(log_str)
return None
if dtype not in device_info.tops:
log_str = (
f"Device {name} does not have a datasheet entry for {dtype}, returning None"
)
info(log_str)
return None
return device_info.tops[dtype]
Loading
Loading
0