[Inductor] Fix the High Order Op layout issue #128275

leslie-fang-intel · 2024-06-08T01:32:07Z

Stack from ghstack (oldest at bottom):

-> [Inductor] Fix the High Order Op layout issue #128275

Fix the issue: #127995

In current implementation of creating FallbackKernel, the device of the NoneLayout is set to None when example_output returns from cls.process_kernel is None.

pytorch/torch/_inductor/ir.py

Lines 5632 to 5649 in 921aa19

    
           with context: 
        
               ( 
        
                   example_output, 
        
                   tensor_args, 
        
                   non_tensor_args, 
        
                   unflatten_args, 
        
                   unbacked_bindings, 
        
               ) = cls.process_kernel(kernel, *args, **kwargs) 
        
           if example_output is None: 
        
               packed = cls( 
        
                   NoneLayout(None), 
        
                   kernel, 
        
                   tensor_args, 
        
                   non_tensor_args, 
        
                   unflatten_args, 
        
                   unbacked_bindings=unbacked_bindings, 
        
               )

If a ExternalKernel schedulerNode has None device, the previous buffer will not flush before codegen this ExternalKernel schedulerNode which causes the wrong generated code.

pytorch/torch/_inductor/scheduler.py

Lines 2701 to 2709 in ef2b5ed

    
           if not isinstance(node, NopKernelSchedulerNode) and ( 
        
               device := node.get_device() 
        
           ): 
        
               if ( 
        
                   device != self.current_device 
        
                   or node.is_extern() 
        
                   or node.is_template() 
        
               ): 
        
                   self.flush()

Test Plan

python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-08T01:32:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128275

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 9d761e1 with merge base a3af32c ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench_freezing, 1, 2, linux.12xlarge) (gh) (similar failure)
DALLE2_pytorch
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh) (similar failure)
DALLE2_pytorch
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh) (similar failure)
DALLE2_pytorch

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fix the issue: #127995 - In current implementation, the device of the `NoneLayout` will be None when `example_output` returns from `cls.process_kernel` is None. The test reported in the issue is the case when `ExternalKernel` returns None. https://github.com/pytorch/pytorch/blob/921aa194c77f5279b15415eaa213813ddcdb3b29/torch/_inductor/ir.py#L5632-L5649 - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. https://github.com/pytorch/pytorch/blob/ef2b5ed500cba0b8b2bf04e6006a0d64c910f440/torch/_inductor/scheduler.py#L2701-L2709 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 39c641c Pull Request resolved: #128275

eellison

looks good, one comment

eellison · 2024-06-13T21:27:14Z

test/higher_order_ops/test_with_effects.py

@@ -198,6 +198,38 @@ def f(x):
        res = torch.compile(f, backend="inductor")(*inputs)
        self.assertTrue(torch.allclose(res, f(*inputs)))

+    @unittest.skipIf(IS_WINDOWS, "triton")
+    @unittest.skipIf(TEST_WITH_ROCM, "triton")
+    @unittest.skipIf(_get_torch_cuda_version() >= (11, 7), "triton")


This doesn't test cuda. why are skipping here ? also, you can test directly torch.utils._triton import has_triton

Remove it and let's see what‘s CI think about. I just copy these skip from UT test_register_effectful_custom_op in this same file. cc @zou3519 @angelayi Do you know any corner case for test_register_effectful_custom_op to skip the triton testing?

I think I need to skip this test on windows, the preCI reports error in X win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) as

2024-06-14T03:11:36.4507249Z torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: 2024-06-14T03:11:36.4508166Z InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++')

I think you just need to skip on windows

Thanks, skip this test on Windows.

eellison · 2024-06-13T21:56:38Z

torch/_inductor/ir.py

        if example_output is None:
            packed = cls(
-                NoneLayout(None),
+                NoneLayout(device),


Now that we're propagating device here, can we remove some of the if device is not None checks from the original pr:

493478d#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R2443

From the implementation of FallbackKernel.find_device,

pytorch/torch/_inductor/ir.py

Lines 5455 to 5471 in 685fcfb

def find_device(tensor_args, example_output):

if tensor_args:

devices = [arg.get_device() for arg in tensor_args if arg.get_device()]

return devices[0]

if isinstance(example_output, torch.Tensor):

return example_output.device

if isinstance(example_output, (list, tuple)):

device_set = {FallbackKernel.find_device(None, x) for x in example_output}

# Remove None

devices = [device for device in device_set if device]

if len(devices) == 1:

return devices[0]

for device in devices:

if is_gpu(device.type):

return device

return devices[0]

return None

, I think the device may still be None.

Fix the issue: #127995 - In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. https://github.com/pytorch/pytorch/blob/921aa194c77f5279b15415eaa213813ddcdb3b29/torch/_inductor/ir.py#L5632-L5649 - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. https://github.com/pytorch/pytorch/blob/ef2b5ed500cba0b8b2bf04e6006a0d64c910f440/torch/_inductor/scheduler.py#L2701-L2709 **Test Plan** ``` python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 9fa09db Pull Request resolved: #128275

Differential Revision: D58590076

leslie-fang-intel · 2024-06-14T23:49:48Z

Hi @zou3519 @angelayi, looks like there is a import of this PR in 31df0c6, please let me know when you think it's ok to land this PR.

angelayi · 2024-06-15T00:19:35Z

@leslie-fang-intel Yes! It is fine to land this PR!

leslie-fang-intel · 2024-06-15T00:31:17Z

@pytorchbot merge

pytorchmergebot · 2024-06-15T00:32:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

zou3519 · 2024-06-17T14:30:21Z

@leslie-fang-intel thank you for the fix. I'll submit this for cherry-pick onto the 2.4 branch

Fix the issue: #127995 - In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. https://github.com/pytorch/pytorch/blob/921aa194c77f5279b15415eaa213813ddcdb3b29/torch/_inductor/ir.py#L5632-L5649 - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. https://github.com/pytorch/pytorch/blob/ef2b5ed500cba0b8b2bf04e6006a0d64c910f440/torch/_inductor/scheduler.py#L2701-L2709 **Test Plan** ``` python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none ``` Pull Request resolved: #128275 Approved by: https://github.com/eellison

Fix the issue: #127995 - In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. https://github.com/pytorch/pytorch/blob/921aa194c77f5279b15415eaa213813ddcdb3b29/torch/_inductor/ir.py#L5632-L5649 - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. https://github.com/pytorch/pytorch/blob/ef2b5ed500cba0b8b2bf04e6006a0d64c910f440/torch/_inductor/scheduler.py#L2701-L2709 **Test Plan** ``` python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none ``` Pull Request resolved: #128275 Approved by: https://github.com/eellison Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>

[Inductor] Fix the High Order Op layout issue

bea748e

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 8, 2024

leslie-fang-intel requested review from zou3519, jgong5 and angelayi June 8, 2024 01:32

leslie-fang-intel added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 8, 2024

pytorchbot added the open source label Jun 8, 2024

leslie-fang-intel added this to the 2.4.0 milestone Jun 8, 2024

leslie-fang-intel added a commit that referenced this pull request Jun 8, 2024

[Inductor] Fix the High Order Op layout issue

4364524

ghstack-source-id: 39c641c Pull Request resolved: #128275

zou3519 requested a review from eellison June 10, 2024 15:17

eellison reviewed Jun 13, 2024

View reviewed changes

eellison approved these changes Jun 13, 2024

View reviewed changes

leslie-fang-intel added the topic: not user facing topic category label Jun 14, 2024

leslie-fang-intel added a commit that referenced this pull request Jun 14, 2024

[Inductor] Fix the High Order Op layout issue

5e07c3e

ghstack-source-id: 9fa09db Pull Request resolved: #128275

pytorch-bot bot pushed a commit that referenced this pull request Jun 14, 2024

Clone of #128275

31df0c6

Differential Revision: D58590076

pytorchmergebot added the merging label Jun 15, 2024

pytorchmergebot added the Merged label Jun 15, 2024

pytorchmergebot closed this in 6616ad0 Jun 15, 2024

pytorchmergebot removed the merging label Jun 15, 2024

leslie-fang-intel mentioned this pull request Jun 15, 2024

nondeterminism in torch.compile + custom op #127995

Closed

This was referenced Jun 17, 2024

[Inductor] Fix the High Order Op layout issue (#128275) #128834

Merged

[v.2.4.0] Release Tracker #128436

Closed

eellison mentioned this pull request Jul 18, 2024

[torchbind] Support torchbind's call_method in inductor #129537

Closed

github-actions bot deleted the gh/leslie-fang-intel/111/head branch July 18, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Fix the High Order Op layout issue #128275

[Inductor] Fix the High Order Op layout issue #128275

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	with context:
	(
	example_output,
	tensor_args,
	non_tensor_args,
	unflatten_args,
	unbacked_bindings,
	) = cls.process_kernel(kernel, args, *kwargs)

	if example_output is None:
	packed = cls(
	NoneLayout(None),
	kernel,
	tensor_args,
	non_tensor_args,
	unflatten_args,
	unbacked_bindings=unbacked_bindings,
	)

	if not isinstance(node, NopKernelSchedulerNode) and (
	device := node.get_device()
	):
	if (
	device != self.current_device
	or node.is_extern()
	or node.is_template()
	):
	self.flush()

	def find_device(tensor_args, example_output):
	if tensor_args:
	devices = [arg.get_device() for arg in tensor_args if arg.get_device()]
	return devices[0]
	if isinstance(example_output, torch.Tensor):
	return example_output.device
	if isinstance(example_output, (list, tuple)):
	device_set = {FallbackKernel.find_device(None, x) for x in example_output}
	# Remove None
	devices = [device for device in device_set if device]
	if len(devices) == 1:
	return devices[0]
	for device in devices:
	if is_gpu(device.type):
	return device
	return devices[0]
	return None

[Inductor] Fix the High Order Op layout issue #128275

[Inductor] Fix the High Order Op layout issue #128275

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128275

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!