The Checkered Flag
During a recent webinar one of our attendees asked us the question:
“So which is faster, the constant cache or the read-only data cache?”
The constant cache and the read-only cache are intended for quite different purposes. Let’s briefly consider a Ferrari 458 Speciale and Dodge Grand Caravan instead. If you need to post a fast lap time at Nürburgring, take the Ferrari. However, the Dodge is a better choice if you need to get 6 passengers and their luggage to a weekend getaway destination.
With that in mind, let’s take a closer look at the constant cache and the read-only data cache.
Constant memory has been available to CUDA developers since the very first generation of CUDA capable GPUs (Compute Capability 1.0). Data in constant memory:
- Resides in a 64KB partition of device memory
- Is accessed through an 8KB cache on each SM(X)
- Is intended to be broadcast to all threads in a warp
The last point is important. If all the threads in the warp request the same value, that value is delivered to all threads in a single cycle. If the threads in a warp request N different values, the request is serialized and the N values are delivered one at a time over N clock cycles. You want to ensure your indexes into constant memory arrays are not functions of
Read-Only Data Cache
The read-only data cache was introduced with Compute Capability 3.5 architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780 GPUs). Similar functionality has been available since Compute Capability 1.0 devices, although you needed to use the somewhat unconventional texture path to take advantage of it.
Each SMX has a 48KB read-only cache. The CUDA compiler automatically accesses data via the read-only cache when it can determine that data is read-only for the lifetime of your kernel. In practice, you need to qualify pointers as
__restrict__ before the compiler can satisfy this condition. You can also specify a read-only data cache access with the
__ldg() intrinsic function.