-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Inductor logging + analysis of torch.profile #149697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,6 +27,7 @@ | |
import torch.optim | ||
import torch.utils.data | ||
from torch._C._profiler import _ExperimentalConfig, _ExtraFields_PyCall | ||
from torch._inductor.ir import FixedLayout | ||
from torch.autograd.profiler import KinetoStepTracker, profile as _profile | ||
from torch.autograd.profiler_legacy import profile as _profile_legacy | ||
from torch.profiler import ( | ||
|
@@ -2998,6 +2999,64 @@ def validate_json(prof): | |
assert "Overload Name" in key_averages.table() | ||
validate_json(prof) | ||
|
||
@unittest.skipIf(not torch.cuda.is_available(), "CUDA is required") | ||
# this tests to see if we can only use a Triton backend for max autotune | ||
@unittest.skipIf( | ||
torch.cuda.is_available() | ||
and not torch._inductor.utils.use_triton_template( | ||
FixedLayout(torch.device("cuda"), torch.float16, [400, 800]) | ||
), | ||
"Solo triton backend not possible", | ||
) | ||
def test_profiler_debug_autotuner(self): | ||
""" | ||
This test makes sure that profiling events will be present when the kernel is run using the DebugAutotuner. | ||
""" | ||
in1 = torch.randn((400, 600), device="cuda", dtype=torch.float16) | ||
in2 = torch.randn((600, 800), device="cuda", dtype=torch.float16) | ||
|
||
Comment on lines
+3015
to
+3017
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: make tensors aligned so we wont do padding |
||
def mm(): | ||
return torch.mm(in1, in2) | ||
|
||
pb_mm = torch.compile( | ||
mm, | ||
options={ | ||
"benchmark_kernel": True, | ||
"max_autotune": True, | ||
"max_autotune_gemm_backends": "TRITON", | ||
"profile_bandwidth": True, | ||
}, | ||
) | ||
comp_mm = torch.compile( | ||
mm, | ||
options={ | ||
"benchmark_kernel": True, | ||
"max_autotune": True, | ||
"max_autotune_gemm_backends": "TRITON", | ||
}, | ||
) | ||
|
||
with profile() as prof1: | ||
pb_mm() | ||
with profile() as prof2: | ||
comp_mm() | ||
|
||
def names(prof): | ||
return { | ||
ev.name | ||
for ev in prof.events() | ||
if "mm" in ev.name or "triton" in ev.name | ||
} | ||
|
||
trace1 = "/tmp/trace1_pb.json" | ||
trace2 = "/tmp/trace2_nopb.json" | ||
prof1.export_chrome_trace(trace1) | ||
prof2.export_chrome_trace(trace2) | ||
Comment on lines
+3053
to
+3054
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does this test ? |
||
|
||
n1 = names(prof1) | ||
n2 = names(prof2) | ||
self.assertEqual(n1, n2) | ||
|
||
|
||
if __name__ == "__main__": | ||
run_tests() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# `torch._inductor.analysis` | ||
Contains scripts for inductor performance analysis. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
from dataclasses import dataclass | ||
from logging import info | ||
from typing import Optional | ||
|
||
import torch | ||
|
||
|
||
@dataclass(frozen=True) | ||
class DeviceInfo: | ||
""" | ||
Theoretical Numbers from data sheet. If two numbers are given, Tensor/Matrix Core vs not, | ||
then the higher number is reported. Sparsity is not considered. | ||
|
||
|
||
Bandwidth numbers are tricky, because there are platform differences that may not show up in the profiler trace. | ||
For example, | ||
""" | ||
|
||
tops: dict[torch.dtype, float] | ||
dram_bw_gbs: float | ||
dram_gb: float | ||
|
||
|
||
# Indexing is based on `torch.cuda.get_device_name()` | ||
# TODO investigate profiler support for tf32 and allow device to report correct number when it's turned on. | ||
_device_mapping: dict[str, DeviceInfo] = { | ||
Comment on lines
+24
to
+26
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we file an issue for this as a follow up ? It's not great that we are not doing this programatically.. |
||
# Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet | ||
"NVIDIA H100": DeviceInfo( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how do we distinguish between
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is based on
|
||
tops={ | ||
torch.float64: 9.7, | ||
torch.float32: 19.5, | ||
torch.bfloat16: 1979.0, | ||
torch.float16: 1979.0, | ||
torch.float8_e8m0fnu: 3958.0, | ||
torch.float8_e8m0fnu: 3958.0, | ||
torch.float8_e4m3fnuz: 3958.0, | ||
Comment on lines
+28
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know the fbcode servers are clock rate limited. So, the numbers will be off for those.. I think this is actually somewhat important for getting accurate numbers. cc @bertmaher - who did similar analysis here - https://fb.workplace.com/groups/420659799592399/posts/761265522198490/ How would you adjust for clock rate ? Is something simple like current_clock_rate/default sufficient ? I dont have a good sense of this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. interested in Bert's opinion too, I would think that |
||
torch.float8_e5m2: 3958.0, | ||
torch.float8_e5m2fnuz: 3958.0, | ||
torch.float8_e8m0fnu: 3958.0, | ||
torch.int8: 3958.0, | ||
}, | ||
dram_bw_gbs=3350, | ||
dram_gb=80, | ||
), | ||
# Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet | ||
"NVIDIA A100": DeviceInfo( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I saw that, I'm not sure how we solve in general for bw (flops seems fine). Like on 8x machine, the interconnect bw could be more important than dram bw. |
||
tops={ | ||
torch.float64: 19.5, | ||
torch.float32: 19.5, | ||
torch.bfloat16: 312.5, | ||
torch.float16: 312.5, | ||
# Not in datasheet: float8 | ||
torch.int8: 624.0, | ||
}, | ||
dram_bw_gbs=2039.0, | ||
dram_gb=80.0, | ||
), | ||
# Source: https://resources.nvidia.com/en-us-gpu-resources/l4-tensor-datasheet | ||
"NVIDIA L4": DeviceInfo( | ||
tops={ | ||
# This is a guess, not in datasheet | ||
torch.float64: 15.1, | ||
torch.float32: 30.3, | ||
torch.bfloat16: 242.0, | ||
torch.float16: 242.0, | ||
torch.float8_e8m0fnu: 485.0, | ||
torch.float8_e8m0fnu: 485.0, | ||
torch.float8_e4m3fnuz: 485.0, | ||
torch.float8_e5m2: 485.0, | ||
torch.float8_e5m2fnuz: 485.0, | ||
torch.float8_e8m0fnu: 485.0, | ||
torch.int8: 485.0, | ||
}, | ||
dram_bw_gbs=3350, | ||
dram_gb=24, | ||
), | ||
# Source: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300a-data-sheet.pdf | ||
"AMD MI300A": DeviceInfo( | ||
tops={ | ||
torch.float64: 122.6, | ||
torch.float32: 122.6, | ||
# torch.tf32: 490.3, | ||
torch.bfloat16: 980.6, | ||
torch.float16: 980.6, | ||
torch.float8_e8m0fnu: 1961.2, | ||
torch.float8_e8m0fnu: 1961.2, | ||
torch.float8_e4m3fnuz: 1961.2, | ||
torch.float8_e5m2: 1961.2, | ||
torch.float8_e5m2fnuz: 1961.2, | ||
torch.float8_e8m0fnu: 1961.2, | ||
torch.int8: 1961.2, | ||
}, | ||
dram_bw_gbs=5300.0, | ||
dram_gb=128.0, | ||
), | ||
# Source: https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf | ||
"AMD MI300X": DeviceInfo( | ||
tops={ | ||
torch.float64: 163.4, | ||
torch.float32: 163.4, | ||
torch.bfloat16: 1307.4, | ||
torch.float16: 1307.4, | ||
torch.float8_e8m0fnu: 2614.9, | ||
torch.float8_e8m0fnu: 2614.9, | ||
torch.float8_e4m3fnuz: 2614.9, | ||
torch.float8_e5m2: 2614.9, | ||
torch.float8_e5m2fnuz: 2614.9, | ||
torch.float8_e8m0fnu: 2614.9, | ||
torch.int8: 2614.9, | ||
}, | ||
dram_bw_gbs=5300.0, | ||
dram_gb=192.0, | ||
), | ||
} | ||
|
||
|
||
def lookup_device_info(name: str) -> Optional[DeviceInfo]: | ||
""" | ||
Problem: when diffing profiles between amd and nvidia, we don't have access to the device information | ||
of the other one. Also, since the analysis is static, we should be able to do it on another device unrelated | ||
to the recorded device. Therefore, _device_mapping statically contains the information for lots of devices. | ||
If one is missing, please run DeviceInfo.get_device_info() and add it to _device_mapping. | ||
name (str): name of the device to lookup. Should map onto torch.cuda.get_device_name(). | ||
""" | ||
if name not in _device_mapping: | ||
return None | ||
return _device_mapping[name] | ||
|
||
|
||
def datasheet_tops(dtype: torch.dtype) -> Optional[float]: | ||
""" | ||
Get the theoretical TFLOPS of the device for a given dtype. This can throw an exception if the device | ||
is not in the datasheet list above. | ||
""" | ||
name: Optional[str] = torch.cuda.get_device_name() | ||
if name is None: | ||
info("No device found, returning None") | ||
return None | ||
device_info = lookup_device_info(name) | ||
if device_info is None: | ||
log_str = f"Device {name} not in datasheet, returning None" | ||
info(log_str) | ||
return None | ||
if dtype not in device_info.tops: | ||
log_str = ( | ||
f"Device {name} does not have a datasheet entry for {dtype}, returning None" | ||
) | ||
info(log_str) | ||
return None | ||
return device_info.tops[dtype] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this is a bit indirect. Could we just do the same checks we do in test_max_autotune that checks if we can run it ? see
is_big_gpu
. I dont like reaching into implementation details when it doesnt add any benefit.