-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Inductor logging + analysis of torch.profile #149697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149697
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit 35daac3 with merge base a4459cd ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ced9053
to
09d34f9
Compare
Why matmul/convolution is missing in the diff table? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, a few comments.
General questions:
-
What is this enabled under, I see config.profile_bandwidth and config. benchmark_kernel ?
-
I think not being on by default makes sense. But just to say it - if this is on by default, we need to be careful about interfering with caching by annotating the configs. But anyways, not a real issue.
-
Could be nice to express flops/memory bandwidth as expression of symints. bc today they could be misleading with different shapes..
-
Ultimately we are going to want not to look at memory bandwidth or flops as % of hardware limits. so would be great to enable that as part of perfetto if possible.
For my own understanding - how does this get into perfetto trace ? and is there anyway to augment it more outside of perfetto (like for calculating memory bw based on symint inputs). cc @davidberard98 who is knowledgeable on profiler.
torch/_inductor/scheduler.py
Outdated
@@ -736,6 +737,48 @@ def get_buf_bytes( | |||
|
|||
return buf_byte_accesses | |||
|
|||
@cache_on_self | |||
def estimate_flops(self) -> int | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we want to handle symints ? we would need to augment caching autotuner with a sympy expression based on inputs.
FlopCounter should work with sympy. see, colab here: https://colab.research.google.com/drive/1zjAisRrc8R6uixKsrs1DRm3lwz5MWN68#scrollTo=zbl9to6G02vQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eellison So symints ended up breaking this feature when used in max autotune, because it would generate an inductor_meta
like:
inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': '969E2A31A8F620FC2E1E37975F57012FF2CF1D2549D5BE17FB534BA8B4F5EF84', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': True, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'grid_type': 'FixedGrid', 'fixed_grid': ['_grid_0', '_grid_1', '_grid_2'], 'extra_launcher_args': ['_grid_0', '_grid_1', '_grid_2'], 'kernel_num_gb': 0.042842176, 'kernel_flop': 1536*s27*s38},
and would throw an error:
File "/tmp/tmpbx3d5iaf/4y/c4yldhzvilvh5h35ih35aud3c42bbzeuuvxspmufnlojgth7yxgy.py", line 20, in <module>
torch._inductor.exc.InductorError: NameError: name 's27' is not defined
So I added a line like:
resolved_flops = V.graph.sizevars.size_hints((flops,))[0]
The
You mean adding to
Good point, I'll take a look.
On it.
"kernel_bandwidth": self.inductor_meta.get("kernel_num_gb", None),
"kernel_flops": self.inductor_meta.get("kernel_flops", None), These two lines add it to the |
The symints is not needed for this pr ! i think other things more important. but could be cool in the future. |
Overall seems fine to me. Just wondering if we think we will want this on by default? Many users have been complaining about the size of the traces |
Test failures ? |
0d7e9f6
to
f1ae257
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited for this ! Left some comments.
cc @sraikund16 can you review as well ? At least the profiler relevant parts.
Also, today, a user will have to manually run the post processing step. It would be great if there were an ap to register a post process fn that will be called before the profile is saved so this gets added automatically.
# TODO investigate profiler support for tf32 and allow device to report correct number when it's turned on. | ||
_device_mapping: dict[str, DeviceInfo] = { | ||
# Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet | ||
"NVIDIA H100": DeviceInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we distinguish between
H100 SXM
and H100 NVL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on torch.cuda.get_device_name()
, which is stored by the profiler and available at runtime too. I'm not sure how to distinguish them, even at runtime.
Some ideas:
>>> torch.cuda.get_device_properties()
_CudaDeviceProperties(name='NVIDIA H100', major=9, minor=0, total_memory=97272MB, multi_processor_count=132, uuid=6efc17fa-5b7e-0452-613b-df241e45f2b8, L2_cache_size=60MB)
>>> torch.cuda.mem_get_info()
(99949740032, 101997215744)
dram_gb=80, | ||
), | ||
# Source: https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet | ||
"NVIDIA A100": DeviceInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I saw that, I'm not sure how we solve in general for bw (flops seems fine). Like on 8x machine, the interconnect bw could be more important than dram bw.
"NVIDIA H100": DeviceInfo( | ||
tops={ | ||
torch.float64: 9.7, | ||
torch.float32: 19.5, | ||
torch.bfloat16: 1979.0, | ||
torch.float16: 1979.0, | ||
torch.float8_e8m0fnu: 3958.0, | ||
torch.float8_e8m0fnu: 3958.0, | ||
torch.float8_e4m3fnuz: 3958.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the fbcode servers are clock rate limited. So, the numbers will be off for those.. I think this is actually somewhat important for getting accurate numbers.
cc @bertmaher - who did similar analysis here - https://fb.workplace.com/groups/420659799592399/posts/761265522198490/
How would you adjust for clock rate ? Is something simple like current_clock_rate/default sufficient ? I dont have a good sense of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interested in Bert's opinion too, I would think that current_clock_rate/default sufficient
would be fine considering that most of the flops calculation are just clockrate * core count * flops per core.
def conv_adapter( | ||
shapes: tuple[Any, ...], concrete: tuple[Any, ...] | ||
) -> tuple[tuple[Any], dict[Any, Any]]: | ||
tmp = list(shapes) | ||
if len(tmp) == 4: | ||
transposed = False | ||
|
||
transposed = bool(tmp[6]) | ||
tmp[6] = transposed | ||
|
||
kwargs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any more details on what the format is for the json trace ? Ideally we would be able to do this automatically with schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json trace args is just the input args dimensions in a list (convenient). For convolution, if Transposed is set, then the flops counter also needs "out_val", which at runtime contains the output tensor. I don't think it's necessary, basically we could move the conv_out_dims
function inside the flops counter, but I didn't want to mess with the flops counter in this PR.
Other than that the len(tmp) == 4 comes from sometimes the input args only had 4 dims, not sure why. Maybe convolution
vs _convolution
...
In summary, I think we can do it from just the schema, but it requires the flops counter to be smarter. I expected other ops to have more conversions, but it's really just convolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we create FakeTensors with the input shapes, then run FlopCounter with them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just implemented this, but I think the code ends up looking worse because when I call the size function directly, it knows to ignore all of the other inputs like SymInt groups
, bool deterministic
, whereas if I run the op they get typechecked in c++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the current impl is wrong.. we should just instantiate the tensors, and then run FlopCounter mode, and run the op. you dont need to run the size function directly.
4ec6ab6
to
3948b47
Compare
c5df71f
to
57716b8
Compare
da97b18
to
1d36147
Compare
This refactors `estimate_flops` and `get_estimated_runtime` on scheduler nodes: 1. New function on BaseSchedulerNode: `estimate_flops`. Works with all types of ir nodes now, not just `ExternalKernels`. 1. Extends `get_estimated_runtime` to work with non-`ExternalKernels`. Prelude to: #149697 Testing: New unit tests cover functionality. Pull Request resolved: #152708 Approved by: https://github.com/xmfan, https://github.com/eellison
1d36147
to
a2c29a4
Compare
a2c29a4
to
35daac3
Compare
and not torch._inductor.utils.use_triton_template( | ||
FixedLayout(torch.device("cuda"), torch.float16, [400, 800]) | ||
), | ||
"Solo triton backend not possible", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this is a bit indirect. Could we just do the same checks we do in test_max_autotune that checks if we can run it ? see is_big_gpu
. I dont like reaching into implementation details when it doesnt add any benefit.
in1 = torch.randn((400, 600), device="cuda", dtype=torch.float16) | ||
in2 = torch.randn((600, 800), device="cuda", dtype=torch.float16) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make tensors aligned so we wont do padding
prof1.export_chrome_trace(trace1) | ||
prof2.export_chrome_trace(trace2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this test ?
# Indexing is based on `torch.cuda.get_device_name()` | ||
# TODO investigate profiler support for tf32 and allow device to report correct number when it's turned on. | ||
_device_mapping: dict[str, DeviceInfo] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we file an issue for this as a follow up ? It's not great that we are not doing this programatically..
node_type = "in_out" | ||
else: | ||
node_type = "default" | ||
counted_flops = count_flops_fx(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make count_flops_fx
a little bit smarter where it does not try to run this unless the node has a registration in flop_counter ? currently we always instantiate the tensors, and faketensormode, and run ?
@register_flop_formula([aten.convolution, aten._convolution]) | ||
def conv_flop(x_shape, w_shape, _bias, _stride, _padding, _dilation, transposed, *args, out_shape=None, **kwargs) -> int: | ||
@register_flop_formula([aten.convolution, aten._convolution, aten.cudnn_convolution]) | ||
def conv_flop(x_shape, w_shape, bias, stride, padding, dilation, transposed, *args, out_shape=None, **kwargs) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reason why we needed to change arg names ?
if len(tmp) == 4: | ||
transposed = False | ||
|
||
transposed = bool(tmp[6]) | ||
tmp[6] = transposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If len(tmp) == 4, why are we indexing into tmp[6]?
def conv_adapter( | ||
shapes: tuple[Any, ...], concrete: tuple[Any, ...] | ||
) -> tuple[tuple[Any], dict[Any, Any]]: | ||
tmp = list(shapes) | ||
if len(tmp) == 4: | ||
transposed = False | ||
|
||
transposed = bool(tmp[6]) | ||
tmp[6] = transposed | ||
|
||
kwargs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the current impl is wrong.. we should just instantiate the tensors, and then run FlopCounter mode, and run the op. you dont need to run the size function directly.
return flop_function(*args, **kwargs) | ||
|
||
|
||
def _estimate_gb(event: dict[str, Any]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this running? for the triton kernels, we should already have this, right ?
I mentioned this in a review internally - if we're going to estimate gb for mms, we should use the mm data access formula.
Generally - we have the gb from inductor right ? Also, inductor will dedupe inputs with same buffer, so i dont know that this comment is accurate.
if "Input Dims" not in event["args"] or "Concrete Inputs" not in event["args"]: | ||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breakpoint() in code - is this being tested ?
Prereqs:
Features:
DeviceInfo
class, and new functionget_devic 8000 e_tflops
.countable_fx
andcount_flops_fx
helps get the flops of anfx.Node
.torch.profiler
logging toDebugAutotuner
.profile_analysis.py
:--augment_trace
adds perf estimates to any perfetto json trace,--analyze
creates a summary table of these perf estimates, and--diff
will compare two traces side by side:cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov