Using Shared Memory on NVIDIA GPUs

Using Shared Memory on NVIDIA GPUs

We received a good question on one of our webinars. 

There are a couple of things that are happening with shared memory and its size.  You as the programmer declare the size of your shared memory allocation.  You can do this statically (at compile time) or dynamically (at run time).  If you statically declare a shared memory allocation that exceeds the limit per thread block (48KB), the NVIDIA compiler will generate an error and your program will not compile.  If you dynamically attempt to allocate more memory than is available, your GPU kernel will not launch and an error code will be returned to your program, although you have to check for error codes programmatically. You can also have a combination of multiple static and one dynamic allocation.  If the sum of the allocations exceeds the thread limit per block, the GPU kernel will not launch.

Finally, shared memory is allocated exclusively to a thread block to facilitate zero-overhead context switching.  As a result the total amount of shared memory on the SM limits the number of thread blocks that can be scheduled to run concurrently on the SM, which impacts occupancy, and potentially performance.  For more details on occupancy and performance, click here to watch our optimization webinar.