Dan Cyca's blog

Tesla Meets Maxwell

Tesla M40 and Tesla M60 - A New Epoch for GPU Computing

NVIDIA Tesla M40 GPU

One scientific epoch ended and another began with James Clerk Maxwell

Albert Einstein

Opt-In L1 Caching of Global Loads on Some Kepler/Maxwell GPUs

Background

CUDA developers generally strive for coalesced global memory accesses and/or explicit ‘caching’ of global data in shared memory.  However, sometimes algorithms have memory access patterns that cannot be coalesced, and that are not a good fit for shared memory.  Fermi GPUs have an automatic L1 cache on each streaming multiprocessor (SM) that can be beneficial for these problematic global memory access patterns.  First-generation Kepler GPUs have an automatic L1 cache on each SM, but it only caches local memory accesses.  In these GPUs, the lack of automatic L1 cache for global memory is partially offset by the introduction of a separate 48 KB read-only (née texture) cache per SM.

Opt-In L1 Caching on Kepler GPUs

NVIDIA quietly re-enabled L1 caching of global memory on GPUs based on the GK110B, GK20A, and GK210 chips.  The Tesla K40 (GK110B), Tesla K80 (GK210) and Tegra K1 (GK20A) all support this feature.  You can programmatically query whether a GPU supports caching global memory operations using cudaGetDeviceProperties and examining the globalL1CacheSupported property.  Examining the Compute Capability alone is not sufficient; Tesla K20/K20x and Tesla K40 both support Compute Capability 3.5, but only the K40 supports caching global memory in L1.

Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with each bank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x) shared memory also has 32 banks, but the bandwidth per bank is only 32-bits per clock cycle.

In Fermi GPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs are configurable where either successive 32-bit words OR 64-bit words are assigned to successive banks. You can specify your desired configuration by calling cudaDeviceSharedMemConfig() from the host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefault, cudaSharedMemBankSizeFourByte, or cudaSharedMemBankSizeEightByte.

In the eight byte configuration, bank conflicts occur when two or more threads in a warp request different 64-bit words from the same bank. In the four byte configuration, bank conflicts occur if two or more threads in a warp access 32-bit words from the same bank where those words span multiple 64-word aligned segments (Figure 1).

Figure 1 - Bank Conflicts in the Four Byte Configuration

Performance Implications

For kernels operating on double precision values, the eight byte configuration on Kepler allows loads/stores to consecutive doubles without bank conflicts. This is an improvement over Fermi GPUs, where accessing double precision values always incurred bank conflicts. If your kernels are operating on single precision values, you are only utilizing half the available shared memory bandwidth. Modifying your kernels to operate on 64-bit values in shared memory (perhaps using float2) may improve performance if shared memory throughput is a performance bottleneck for your kernel.

Webinar: CUDA Tools for Optimal Performance and Productivity

Presented by Dan Cyca

This webinar provides an overview of the profiling techniques and the tools available to help you optimize your code. We will examine NVIDIA’s Visual Profiler and cuobjdump and highlight the various methods available for understanding the performance of your CUDA program. The second part of the session will focus on debugging techniques and the tools available to help you identify issues in your kernels. The debugging tools provided in CUDA 5.5 including NSight and cuda-memcheck will be discussed.

Click here to find out more about Acceleware's CUDA training.

Constant Cache vs. Read-Only Cache - Part 1

The Checkered Flag

During a recent webinar one of our attendees asked us the question:
“So which is faster, the constant cache or the read-only data cache?”

The constant cache and the read-only cache are intended for quite different purposes. Let’s briefly consider a Ferrari 458 Speciale and Dodge Grand Caravan instead. If you need to post a fast lap time at Nürburgring, take the Ferrari. However, the Dodge is a better choice if you need to get 6 passengers and their luggage to a weekend getaway destination.

With that in mind, let’s take a closer look at the constant cache and the read-only data cache.

Constant Memory

Constant memory has been available to CUDA developers since the very first generation of CUDA capable GPUs (Compute Capability 1.0). Data in constant memory:

  • Resides in a 64KB partition of device memory
  • Is accessed through an 8KB cache on each SM(X)
  • Is intended to be broadcast to all threads in a warp

The last point is important. If all the threads in the warp request the same value, that value is delivered to all threads in a single cycle. If the threads in a warp request N different values, the request is serialized and the N values are delivered one at a time over N clock cycles. You want to ensure your indexes into constant memory arrays are not functions of threadIdx.x.

Read-Only Data Cache

The read-only data cache was introduced with Compute Capability 3.5 architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780 GPUs). Similar functionality has been available since Compute Capability 1.0 devices, although you needed to use the somewhat unconventional texture path to take advantage of it.

Each SMX has a 48KB read-only cache. The CUDA compiler automatically accesses data via the read-only cache when it can determine that data is read-only for the lifetime of your kernel. In practice, you need to qualify pointers as const and __restrict__ before the compiler can satisfy this condition. You can also specify a read-only data cache access with the __ldg() intrinsic function.

Subscribe to RSS - Dan Cyca's blog