Understanding the CUDA memory model and utilizing it effectively is often key in achieving high performance from your NVIDIA GPU. The shuffle instruction, available on Kepler devices (compute 3.0 and newer), is a new tool that programmers can add to their bags of tricks to further optimize memory performance.
Figure 1 – CUDA Memory Model
The CUDA memory model, illustrated in Figure 1, consists of several different memory regions that are on and off chip, with varying scopes, latencies and bandwidths. The two fastest memory regions (lowest latency and highest bandwidth) are registers and shared memory. Registers are allocated by the compiler for each thread and therefore have the scope of a thread. Shared memory is explicitly defined by the programmer, allocated per block and has thread block scope. Since shared memory is visible to every thread in a block it is commonly used as a programmer controlled cache.
With the Kepler shuffle instruction we now have another way, in addition to shared memory, to share values between threads. So why would you want to use the shuffle instruction instead of shared memory? First, you can use the shuffle instruction to free up shared memory to be used for other data or to increase your occupancy. Secondly the shuffle instruction is faster than shared memory since it only requires one instruction versus three for shared memory (write, synchronize, read). Another potential performance advantage for shuffle is that relative to Fermi, shared memory bandwidth has doubled on Kepler devices but the number of compute cores has increased by 6x; therefore, the shuffle instruction provides another means to share data between threads and keep the CUDA cores busy with memory accesses that have low latency and high bandwidth. Finally, you might want to use the shuffle instruction instead of warp-synchronous optimizations (removing __syncthreads()).
The shuffle instruction allows a thread to read values stored in a register from a thread within the same warp. A warp is a group of 32 threads with consecutive thread index values. The general shuffle instruction, __shfl, returns the value stored in a register from any other thread. The source thread is identified by its lane index (laneID) which is the index of a thread within a warp and calculated as thread index % 32.
float __shfl( float var, // Variable you want to read from source thread
int srcLane, // laneID of the source thread
int width=warpSize // Division of warp into segments of size width