Skip to content Skip to sidebar Skip to footer

Why Launching A Numba Cuda Kernel Works With Up To 640 Threads, But Fails With 641 When There's Plenty Of GPU Memory Free?

I have a Numba cuda kernel which I can launch with up to 640 threads and 64 blocks on an RTX 3090. If I attempt to use 641 threads, it fails with: Traceback (most recent call last)

Solution 1:

It's usually a registers per thread issue (CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES). This is covered in many questions here on SO cuda tag such as this one. There are many others also such as here. In short, the total registers used per threadblock cannot exceed the limit for your GPU (see below). Total registers used per theadblock is approximately the total number of registers per thread times the threads per block (potentially rounding up for allocation granularity).

The principal method to address this issue in numba cuda is to include a maximum register usage parameter in your cuda.jit decorator:

@cuda.jit( max_registers=40) 

You can of course set that to other values. A simple heuristic is to divide the total number of registers per SM (or per thead block if it is lower) (discoverable with CUDA deviceQuery sample code or in table 15 of the programming guide) by the total number of threads per block you wish to launch. So if your GPU SM has 64K registers, and you want to launch 1024 threads per block, you would choose a maximum of 64 registers per thread. That number should work for RTX 3090.


Post a Comment for "Why Launching A Numba Cuda Kernel Works With Up To 640 Threads, But Fails With 641 When There's Plenty Of GPU Memory Free?"