-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Closed
Labels
Description
Bug description
When trying a simple example using the new CustomOpLibrary
despite code changes in the kernel the graph compiler seems to be using old cache. I provide an example where changing one line of code and then running it won't recompile this kernel. If one deletes the cache manually then you can see how it changes behaviour.
Steps to reproduce
- Run this simple example and notice how it manages to complete
- Change one line of code which makes it fail. But notice it doesn't.
- Manually clean cache
rm -rf ~/.modular
- Run again and see different behaviour
example.py
from pathlib import Path
import torch
from max.torch import CustomOpLibrary
mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_const
def add_const_1d(x: torch.Tensor) -> torch.Tensor:
result = torch.zeros_like(x, dtype=x.dtype, device=x.device)
add_const_kernel(result, x)
return result
if __name__ == "__main__":
x = torch.randn(10).cuda()
print(add_const_1d(x))
kernels/kernel.mojo
import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(10)
alias dtype = DType.float32
@compiler.register("add_const")
struct AddConst:
@staticmethod
fn execute[
target: StaticString,
](
# Outputs
result: OutputTensor[type = DType.float32, rank=1],
# Inputs
x: InputTensor[type = DType.float32, rank=1],
# Context
ctx: DeviceContextPtr,
) raises:
x_tensor = x.to_layout_tensor()
result_tensor = result.to_layout_tensor()
@parameter
if target == "cpu":
raise Error("Rasterize3DGS CPU target not implemented yet.")
elif target == "gpu":
# Get GPU context
var gpu_ctx = ctx.get_device_context()
# Define grid and block dimensions for the kernel launch
var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
var block = (BLOCK_SIZE)
gpu_ctx.enqueue_memset(
DeviceBuffer[result.type](
gpu_ctx,
rebind[UnsafePointer[Scalar[result.type]]](result_tensor.ptr),
x.dim_size(0),
owning=False,
),
0,
)
gpu_ctx.enqueue_function[add_const_kernel](
x_tensor,
result_tensor,
x.dim_size(0),
grid_dim=grid,
block_dim=block,
)
else:
raise Error
6762
("Unsupported target:", target)
fn add_const_kernel(
x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
size: Int,
):
i = block_idx.x * block_dim.x + thread_idx.x
if i < size:
result[i] = x[i] + 10
Run this and it should work and print a tensor.
Then replace the line in the kernel:
- alias Dyn1DLayout = Layout.row_major(10)
+ alias Dyn1DLayout = Layout.row_major(UNKNOWN_VALUE)
Run and see it will continue to work. But if you remove the cache dir with rm -rf ~/.modular
you will see it fails to run now.
System information
- Provide the system information by running `magic info`. Not running with magic and instead with UV so cannot do that.
Here is the pyproject.toml:
[project]
name = "example"
version = "0.0.0"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"torch>=2.6.0",
"pillow>=11.2.1, <12",
"modular>=25.4.0.dev2025052405",
]
[tool.uv]
[[tool.uv.index]]
url = "https://dl.modular.com/public/nightly/python/simple/"