Opt-In L1 Caching of Global Loads on Some Kepler/Maxwell GPUs


CUDA developers generally strive for coalesced global memory accesses and/or explicit ‘caching’ of global data in shared memory.  However, sometimes algorithms have memory access patterns that cannot be coalesced, and that are not a good fit for shared memory.  Fermi GPUs have an automatic L1 cache on each streaming multiprocessor (SM) that can be beneficial for these problematic global memory access patterns.  First-generation Kepler GPUs have an automatic L1 cache on each SM, but it only caches local memory accesses.  In these GPUs, the lack of automatic L1 cache for global memory is partially offset by the introduction of a separate 48 KB read-only (née texture) cache per SM.

Opt-In L1 Caching on Kepler GPUs

NVIDIA quietly re-enabled L1 caching of global memory on GPUs based on the GK110B, GK20A, and GK210 chips.  The Tesla K40 (GK110B), Tesla K80 (GK210) and Tegra K1 (GK20A) all support this feature.  You can programmatically query whether a GPU supports caching global memory operations using cudaGetDeviceProperties and examining the globalL1CacheSupported property.  Examining the Compute Capability alone is not sufficient; Tesla K20/K20x and Tesla K40 both support Compute Capability 3.5, but only the K40 supports caching global memory in L1.


Table 1 - cudaGetDeviceProperties Output
  Tesla K10 Tesla K20 Tesla K40 Tesla K80
Compute Capability 3.0 3.5 3.5 3.7
globalL1CacheSupported 0 0 1 1
localL1CacheSupported 1 1 1 1


Even on Kepler GPUs that do support caching global memory in L1, the default behavior is to not cache global memory in L1. You have to opt-in to enable caching by passing -Xptxas="-dlcm=ca" as an argument to NVCC when compiling your kernels. Kernels compiled in this fashion will still function on earlier Kepler devices though global memory operations will bypass the L1 cache.

If you opt-in to L1 caching, the SM moves an entire 128B cache-line from L2 or device memory on a cache miss. For some access patterns this is less efficient than bypassing the cache, where the minimum transaction size reduces to 32B. The likelihood of a cache hit depends on SM occupancy, the working set size, the runtime scheduling of warps, and the configuration of L1/shared memory on the SM. You can programmatically control the split between L1 and shared memory via the cudaDeviceSetCacheConfig and cudaFuncSetCacheConfig API calls (Table 2). You may need to resort to trial and error to determine the optimal caching behavior and resource configuration for your kernel(s).


Table 2 - L1 Cache/Shared Memory Configuration Options on Compute 3.x Devices
  L1 Memory Size (KB) Compute 3.0-3.5 Shared Memory Size (KB) Compute 3.7 Shared Memory Size (KB)
cudaFuncCachePreferShared 16 48 112
cudaFuncCachePreferL1 48 16 80
cudaFuncCachePreferEqual 32 32 96


Opt-In on Maxwell GPUs

In first generation Maxwell GPUs (Compute Capability 5.0), the functionality of the L1 cache and texture/read-only cache is combined into a single unit, with a separate dedicated 64KB shared memory unit. The unified L1/texture cache serves as a coalescing buffer for global memory accesses, but not as a cache. Read-only global memory accesses may be cached with appropriate const __restrict__ pointer qualifiers, or via the __ldg() intrinsic. Local memory operations are only cached in L2.

With second-generation Maxwell GPUs (Compute Capability 5.2), you can again opt-in to enable global memory caching in the unified L1/texture cache. This is done through the -Xptxas="-dlcm=ca" argument to NVCC. Note that even though you can opt-in to enable global memory caching on Compute 5.2 devices, the globalL1CacheSupported property returns 0.


Final Thoughts

This post was inspired by Tony Scuderio’s packed talk at GTC in March - S5376 - Memory Bandwidth Bootcamp: Beyond Best Practices. Tony explores several strategies, including the read-only cache and opt-in L1 caching for a problematic memory access pattern similar to what you might encounter when using Array of Structures data-structure layouts.