Webinar: How to Improve Performance using the CUDA Memory Model and Features of the Kepler Architecture

Presented by Chris Mason

Join Chris Mason, Acceleware Product Manager, and explore the memory model of the GPU and the memory enhancements available in the new Kepler architecture and how these will affect your performance optimization. The webinar will begin with an essential overview of GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. We will define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. The shuffle instruction, new shared memory configurations and Read-Only Data Cache of the Kepler architecture are introduced and optimization techniques discussed.

Click here to find out more about Acceleware's CUDA training.

Webinar: An Introduction to GPU Programming

Presented by Chris Mason

This is an informative introduction to GPU computing and a recording of our live webinar in the GTC Express series, September 25th, 2013. The tutorial begins with a brief overview of CUDA and data-parallelism before focusing on the GPU programming model. We are exploring the fundamentals of GPU kernels, host and device responsibilities, CUDA syntax and thread hierarchy. Enjoy!

Click here to find out more about Acceleware's CUDA training.


Constant Cache vs. Read-Only Cache - Part 1

The Checkered Flag

During a recent webinar one of our attendees asked us the question:
“So which is faster, the constant cache or the read-only data cache?”

The constant cache and the read-only cache are intended for quite different purposes. Let’s briefly consider a Ferrari 458 Speciale and Dodge Grand Caravan instead. If you need to post a fast lap time at Nürburgring, take the Ferrari. However, the Dodge is a better choice if you need to get 6 passengers and their luggage to a weekend getaway destination.

With that in mind, let’s take a closer look at the constant cache and the read-only data cache.

Constant Memory

Constant memory has been available to CUDA developers since the very first generation of CUDA capable GPUs (Compute Capability 1.0). Data in constant memory:

  • Resides in a 64KB partition of device memory
  • Is accessed through an 8KB cache on each SM(X)
  • Is intended to be broadcast to all threads in a warp

The last point is important. If all the threads in the warp request the same value, that value is delivered to all threads in a single cycle. If the threads in a warp request N different values, the request is serialized and the N values are delivered one at a time over N clock cycles. You want to ensure your indexes into constant memory arrays are not functions of threadIdx.x.

Read-Only Data Cache

The read-only data cache was introduced with Compute Capability 3.5 architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780 GPUs). Similar functionality has been available since Compute Capability 1.0 devices, although you needed to use the somewhat unconventional texture path to take advantage of it.

Each SMX has a 48KB read-only cache. The CUDA compiler automatically accesses data via the read-only cache when it can determine that data is read-only for the lifetime of your kernel. In practice, you need to qualify pointers as const and __restrict__ before the compiler can satisfy this condition. You can also specify a read-only data cache access with the __ldg() intrinsic function.

NVIDIA Hardware Comparison – GeForce GTX 690 vs. Tesla K10

GeForce GTX 690 vs. Tesla K10

During a recent webinar one of our attendees asked us the question:

“Which hardware is faster – the GeForce GTX 690 or the Tesla K10?”

The short answer: GeForce GTX 690.

That isn’t the entire story though. The specifications for both cards are shown in the table below:

Card Number of GPUs Memory BW (GB/s) SP Performance (GFlops) Memory (GB) Power (W) Cooling
GTX 690 2 (GK104) 284.5 5622 4 300 Fan
K10 2 (GK104) 320 4577 8 250 Passive

While the raw throughput on the GeForce GTX 690 is higher than the Tesla K10, the memory on the GTX 690 is half of the size of the K10. Depending on your application, memory size may outweigh the benefit of the slightly higher throughput on the GeForce.

Additionally, NVIDIA’s Tesla product line has features not available for GeForce hardware:

  • ECC Memory Protection – improved memory reliability
  • Manufacturing Quality – hardware is manufactured and tested by NVIDIA to guarantee highest reliability
  • 1.5 to 2 Year Availability – useful for enterprise customers who need to replace or add hardware after initial deployment
  • Cluster and GPU Management Software – diagnostics useful for monitoring hardware in clusters
  • Advanced Driver Support – TCC drivers allows the use of Windows Remote Desktop and avoids the Windows watchdog timer when running large kernels
  • Form Factors – Tesla is supported in certified servers, blades and workstations

The full list of Tesla benefits can be found here:

In almost all cases, our customers deploy Tesla hardware in production environments. The memory size, reliability and long availability outweigh the modest performance benefits of the GeForce product line.

Finally, for double precision computations you will want to consider the K20X or GeForce GTX Titan since the double precision performance on the GTX 690 and K10 is approximately 1/24th the single precision throughput. But if you can get away with single precision calculations, the GeForce GTX 690 and Tesla K10 are great hardware choices!


Does your CUDA application need to target a specific GPU? If you are writing GPU enabled code, you would typically use a device query to select the desired GPUs. However, a quick and easy solution for testing is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want your GPU enabled executable to target a specific GPU.

GTC, San Jose 2012

This year at NVIDIA’s GPU Technology Conference in San Jose, I had the privilege of hosting a CUDA™ workshop to over 200+ attendees. Throughout the day we covered topics including GPU memory management, debugging, and optimizations. The questions from the audience were very insightful and addressed important aspects of GPU programming. I would like to thank Kiran and Ramesh who were able to help me on stage with the code demonstrations. The feedback from attendees was very positive and I look forward to hosting additional seminars at upcoming conferences.

Cell Phones in the News

SAR, Cell Phones, and San Francisco

An interesting article came across my desk this week. It talks about a recent vote in by the Board of Supervisors in San Francisco requiring that cell phone retailers post the specific absorption rate (SAR) for all cellular handsets that they sell. The article caught my attention because many cell phone manufactures use our FDTD libraries to model the electromagnetic field strengths, which are used to calculate SAR as a post processing step.

Design Automation Conference 2009

I had the opportunity to visit DAC 2009 this year in San Francisco. DAC is an EDA conference which showcases tools and processes that perform simulation, logic synthesis, design verification, and timing analysis.  The work we do in linear algebra has direct applicability in this space, and we are eager to expand our reach in the EDA market. The show itself was well attended and there were approximately 200 exhibitors. It was great to catch up with our customers and reach out to some new ones.

IMS 2009 Wrap Up

IMS/MTT 2009 in Boston was an extremely successful show for Acceleware.

Acceleware at IMS 2009 in Boston

For our fourth consecutive year, Acceleware will be exhibiting at the International Microwave Symposium (IMS) held in Boston from June 9th through June 11th.  I’m excited to see many familiar faces and to meet many new ones. We will be showcasing our products including our accelerated FDTD software, clustered solutions, and our linear algebra solvers.


Subscribe to RSS - blogs