Kepler’s Shuffle Instruction

Understanding the CUDA memory model and utilizing it effectively is often key in achieving high performance from your NVIDIA GPU. The shuffle instruction, available on Kepler devices (compute 3.0 and newer), is a new tool that programmers can add to their bags of tricks to further optimize memory performance.

Figure 1: The CUDA Memory Model
Figure 1 – CUDA Memory Model

The CUDA memory model, illustrated in Figure 1, consists of several different memory regions that are on and off chip, with varying scopes, latencies and bandwidths. The two fastest memory regions (lowest latency and highest bandwidth) are registers and shared memory. Registers are allocated by the compiler for each thread and therefore have the scope of a thread. Shared memory is explicitly defined by the programmer, allocated per block and has thread block scope. Since shared memory is visible to every thread in a block it is commonly used as a programmer controlled cache.

With the Kepler shuffle instruction we now have another way, in addition to shared memory, to share values between threads. So why would you want to use the shuffle instruction instead of shared memory? First, you can use the shuffle instruction to free up shared memory to be used for other data or to increase your occupancy. Secondly the shuffle instruction is faster than shared memory since it only requires one instruction versus three for shared memory (write, synchronize, read). Another potential performance advantage for shuffle is that relative to Fermi, shared memory bandwidth has doubled on Kepler devices but the number of compute cores has increased by 6x; therefore, the shuffle instruction provides another means to share data between threads and keep the CUDA cores busy with memory accesses that have low latency and high bandwidth. Finally, you might want to use the shuffle instruction instead of warp-synchronous optimizations (removing __syncthreads()).

The shuffle instruction allows a thread to read values stored in a register from a thread within the same warp. A warp is a group of 32 threads with consecutive thread index values. The general shuffle instruction, __shfl, returns the value stored in a register from any other thread. The source thread is identified by its lane index (laneID) which is the index of a thread within a warp and calculated as thread index % 32.

float __shfl(  float var,     // Variable you want to read from source thread
     int srcLane,             // laneID of the source thread
     int width=warpSize       // Division of warp into segments of size width  
);

All shuffle instructions are overloaded to read either float or integer variables. Each thread that is participating in the shuffle instruction can specify a different value of srcLane allowing each thread to read another thread’s var value. If the srcLane argument is constant, the srcLane’s var is broadcast to all threads in the warp, as illustrated in Figure 2.

Figure 2: Shuffle Instruction
Figure 2 – Shuffle instruction with constant srcLane broadcasts the value in a register from one thread to all threads in a warp

There are also three variations of the shuffle instruction that implement common sharing patterns. The __shfl_up() and __shfl_down() instructions return the value from a source thread with a lower or higher laneID defined by the delta argument. If the laneID – delta < 0 for the __shfl_up() instruction then the thread’s own value of var is returned and the same is true for the __shfl_down() instruction when laneID + delta > 31. And a butterfly pattern, or technically a bitwise XOR of a thread’s lane ID, can be achieved with the __shfl_xor() instruction. Figure 3 illustrates the shuffle patterns for shuffle up and down.

float __shfl_[down/up](
     float var,           // Variable to read from source thread
     unsigned int delta,  // # of threads you want to shuffle down/up
     int width=warpSize   // Division of warp into segments of size width
);
Figure 3: Shuffle Instruction
Figure 3 – Shuffle up and down instructions illustrated

Let’s look at how we can take advantage of the shuffle instruction for the moving average filter algorithm defined as:

Acceleware Blog - Equation 1

where y[0]=y[1]=y[N-2] =y[N-1] = 0 and N is the number of elements in the input, x, and output, y, arrays. I covered this algorithm as a coding example during my second GTC tutorial using shared memory and you can download the code here. The strategy for using shared memory is to bring all the input values you need to calculate the output for a given block. You read the values once from global memory and then every thread can read the 5 values it needs from shared memory since shared memory is visible to all threads in a block. A similar strategy can be used with the shuffle instruction. With the shuffle instruction we are able to read values from threads within a warp. Therefore the first step will be to bring all the values required for a warp into registers and then the shuffle instruction will be used to access the needed values from registers instead of global memory. Figure 4 illustrates the threads in the output that a warp is responsible for calculating and the corresponding 36 values from the input that are required to perform the moving average calculation.

Figure 4
Figure 4 – Input values required to calculate the output values for a warp

To load these 36 values we start with each thread getting one value from the input array and storing it in a register input0.

// Global thread index
int gIdx = threadIdx.x + blockIdx.x * blockDim.x;
    
// Get data from global memory
float input0 = input[gIdx];

This results in 32 values in registers but we are still missing 4 values. To read these remaining 4 values, we choose 4 threads to accomplish this. Specifically we get the first and last two threads of the warp to get the values at lower (green) and higher (blue) index values.

// laneID the a thread's index within a warp
int laneID = threadIdx.x & 0x1f;

float inputm2 = 0;
float inputp2 = 0;
 
// Get the first two threads to read the
// preceding two values
if(gIdx - 2 >= 0 && laneID < 2)
     inputm2 = input[gIdx - 2];

// Get the last two threads to read the
// following two values
if(gIdx + 2  < N && laneID > 29)
     inputp2 = input[gIdx + 2];

We now have all 36 input values needed to calculate all 32 output values for a warp in registers. To show how this code works lets work with an input array where input[idx] = idx, block size = 512 and we’re looking at the first warp in the second block. Table 1 lists the values in the different registers for the different threads within the warp.

Table 1 – Values in registers for blockDim.x = 512. The gray cells indicate that those threads did not execute those statements
threadIdx.x 0 1 2 3 ... 28 29 30 31
blockIdx.x 1 1 1 1 ... 1 1 1 1
gIdx 512 513 514 515 ... 540 541 542 543
laneID 0 1 2 3 ... 28 29 30 31
inputm2 = input[gIdx-2] 510 511              
input0 = input[gIdx] 512 513 514 515 ... 540 541 542 543
inputp2 = input[gIdx+2]               544 545

We now have all 36 values from 510 to 545 in a register within this warp. Now we’re ready for the shuffle instruction! We’ll start with the shuffle up instruction to get the values at input[gIdx -1] and input[gIdx – 2]. We have to handle the conditions for threads 0 and 1 when the shuffle up instruction does not return the correct value for that register. For the value of input[gIdx-2], we’ve already read that value from global memory in the previous segment of code so we can simply add an if condition so that threads 0 and 1 do not use the value returned from the shuffle instruction. For the input[gIdx – 1] value for thread 0 we note from Table 1 that we have the desired value in the variable inputm2 in thread 1. This is when we use the __shfl() instruction to access that exact value. Putting this all together in code we end up with:

// Using the shuffle up get input[i-1]
inputm1 = __shfl_up(input0, 1);

// The thread with laneID == 0 did not
// get the value at input[i-1] from
// the shuffle instruction above
float temp = __shfl(inputm2,1);
if(laneID == 0)
     inputm1 = temp;

// Using the shuffle up get input[i+2]
// for threads with laneID > 1
temp = __shfl_up(input0,2);
if(laneID > 1)
     inputm2 = temp;

We repeat the same strategy input[idx+1] and input[idx+2] variables using the shfl_down() instruction. Table 3 lists the different instructions used to get the different values into the appropriate variables.

Table 3 – Shuffle Instructions Values
gIdx 512 513 514 515 ... 540 541 542 543
laneID = threadIdx.x & 31 0 1 2 3   28 29 30 31
inputm2 = input[gIdx-2] 510 511              
inputm1 = __shfl(inputm2,1) -1                
inputm2 = __shfl_up(input0,2)     512 513 ... 538 539 540 541
inputm1 = __shfl_up(input1,1)   512 513 514 ... 539 540 541 542
input0 = input[gIdx] 512 513 514 515 ... 540 541 542 543
inputp1 = __shfl_down(input0,1) 513 514 515 516 ... 541 542 543  
inputp2 = __shfl_down(input0,2) 515 516 517 518 ... 543 544    
inputp1 = __shfl(inputp2,30)                 544
inputp2 = input[gIdx+2]               544 545

Now that we have all the values in registers for each thread, the output calculation is performed using register values. The full code sample can be downloaded here.

// Use values in registers to calculate output
if(gIdx > 1 && gIdx < (N - 2))
{
  output[gIdx]   = cWeights[0] * inputm2
                 + cWeights[1] * inputm1
                 + cWeights[2] * input0
                 + cWeights[3] * inputp1
                 + cWeights[4] * inputp2;
}
else if (gIdx < N)
{
  output[gIdx] = 0;
}

A couple of final comments about this shuffle instruction. You might have noticed (note the sarcasm) that this use of the shuffle instruction results in code that is harder to read, understand, debug and maintain. Therefore it is a reasonable question to ask: is this more complicated code worth the performance gain? Depending on your application and performance requirements you may answer this question differently. For this particular application of the shuffle instruction, the benefit gained from the shuffle instruction is minimal-to-none compared to the shared memory implementation. While the shuffle instruction is faster than shared memory, the shared memory has the benefit of being able to share values across an entire block (which should be larger than a warp) and therefore reduces the number of redundant global memory accesses. Therefore while I’ve demonstrated the use of the shuffle instruction on the moving average filter (and it was fun to code and a good learning experience for the shuffle instruction), I recommend turning to NVIDIA’s CUDA Programming Guide for more appropriate usage of the shuffle instruction.