A Simple Trick To Pass Constant Arguments Into GPU Kernels

A Simple Trick To Pass Constant Arguments Into GPU Kernels

Most CUDA developers are familiar with methods of passing constant arguments into GPU kernels.  The simplest method is directly via kernel parameters and the other option is copying to constant memory.  Under certain circumstances though, there’s another lesser-known way to get constants into your GPU kernel, that may even improve kernel performance!  

The following code takes in an array of N floats, the constants N and M, and outputs NM via a for loop over M.  Each element is handled in parallel by a single CUDA thread.  The loop iteration count M is passed in directly via a kernel parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
__global__
void Kernel1(float* __restrict__ const input,
             float* __restrict__       output,
             int                 const N,
             int                 const M)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx > N) return;

    float val = input[idx];
    float res = 1.0f;

    #pragma unroll
    for (int i = 0; i < M; i++)
        res *= val;

    output[idx] = res;
}

 

This second kernel does the same thing as the first, however it uses cudaMemcpyToSymbol to store M in constant GPU memory:

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__constant__ int c_M;

__global__
void Kernel2(float* __restrict__ const input,
             float* __restrict__       output,
             int                 const N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx > N) return;

    float val = input[idx];
    float res = 1.0f;

    #pragma unroll
    for (int i = 0; i < c_M; i++)
        res *= val;

    output[idx] = res;
}

 

In both examples, the compiler cannot determine what the value of M is at compile time, and therefore can only guess at the amount of number of times to unroll, and emit appropriate code to handle other values of M.  However, if M were passed in via a template argument, it would be known at compile time, which would result in optimal loop unrolling:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
template<int t_M>
__global__
void Kernel3(float* __restrict__ const input,
             float* __restrict__       output,
             int                 const N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx > N) return;

    float val = input[idx];
    float res = 1.0f;

    #pragma unroll
    for (int i = 0; i < t_M; i++)
        res *= val;

    output[idx] = res;
}

This however adds the restraint that the constant value M be limited to a certain range of values, so that the appropriate instantiation of the kernel is performed.  

In the example code, this results in something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
    switch (M)
    {
    case 0: LaunchKernel3_M<0>(d_input, d_output, N); break;
    case 1: LaunchKernel3_M<1>(d_input, d_output, N); break;
    case 2: LaunchKernel3_M<2>(d_input, d_output, N); break;
    case 3: LaunchKernel3_M<3>(d_input, d_output, N); break;
    case 4: LaunchKernel3_M<4>(d_input, d_output, N); break;
    case 5: LaunchKernel3_M<5>(d_input, d_output, N); break;
    case 6: LaunchKernel3_M<6>(d_input, d_output, N); break;
    case 7: LaunchKernel3_M<7>(d_input, d_output, N); break;
    case 8: LaunchKernel3_M<8>(d_input, d_output, N); break;
    case 9: LaunchKernel3_M<9>(d_input, d_output, N); break;
    
    default:
        printf("Unsupported value of M\n"); break;
    }

Here is a more detailed timing breakdown for various values of M collected on a NVIDIA P100, all on one billion elements.

 

Kernel 1 and Kernel 2 times are nearly identical, which is as expected, because kernel parameters are passed to the device via constant memory (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters).  The templated argument allows the compiler to unroll the loop properly and out-performs the other two kernels.

Of course, it’s not always possible to use templated arguments, and there’s increased compilation costs because different kernels are generated for each value.  This can expand dramatically when multiple constants are passed in via templated parameters.  However, when performance is key, the payoff might be well worth considering. 

The code files are available to download.