Blogs

Acceleware Tutorial at GTC 2013: CUDA Tools for Optimal Performance and Productivity (S3455)

Presented by Kelly Goss

Get the low down on debugging and profiling your GPU program!  This tutorial dives deep into profiling techniques and the tools available to help you optimize your code.  We demonstrate NVIDIA's Visual Profiler, nvcc flags and cuobjdump and highlight the various methods available for understanding the performance of your CUDA program. The second part of the session focuses on debugging techniques and available tools to help you identify issues in your kernels. The latest debugging tools provided in CUDA 5.0 including NSight and cuda-memcheck are discussed.  A programming demo of the Visual profiler and Nsight is provided.

Watch Kelly’s other GTC tutorials:

Acceleware Tutorial at GTC 2013: Essential Optimization Techniques for NVIDIA Kepler and Fermi Architecture (S3454)

Presented by Kelly Goss

Learn how to optimize your algorithms for the Fermi and Kepler architectures.  This informative tutorial covers the key optimization strategies for compute and memory bound problems.  The session includes techniques for ensuring peak utilization of CUDA cores by choosing the optimal block size and using dynamic parallelism on the Kepler architecture.  For compute bound algorithms we discuss how to improve branching efficiency, using intrinsic functions and loop unrolling.  For memory bound algorithms, optimal access patterns for global and shared memory are presented and the differences between the Fermi and Kepler architecture are highlighted.  This session includes code examples throughout and a programming demonstration highlighting the optimal global memory access pattern which is applicable to all GPU architectures.

Watch Kelly’s other GTC tutorials:

Acceleware Tutorial at GTC 2013: How to Improve Performance using the CUDA Memory Model and Features of the new Kepler Architecture (S3453)

Presented by Kelly Goss

Explore the memory model of the GPU and the memory enhancements available in the new Kepler architecture and how these affect performance. The tutorial begins with an essential overview of GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. We define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. The shuffle instruction, new shared memory configurations and Read-Only Cache of the Kepler architecture are introduced and optimization techniques discussed. A programming demonstration of shared and constant memory is provided.  The demonstration code is then re-written using the shuffle instruction for the Kepler architecture.

Watch Kelly’s other GTC tutorials:

Acceleware Tutorial at GTC 2013: An Introduction to GPU Programming (S3452)

Presented by Kelly Goss

Join us for an informative introduction to GPU Programming. The tutorial will begins with a brief overview of CUDA and data-parallelism before focusing on the GPU programming model. We explore the fundamentals of GPU kernels, host and device responsibilities, CUDA syntax and thread hierarchy. A programming demonstration of two simple CUDA kernels is provided.

Watch Kelly’s other GTC tutorials:

Essential Optimization Techniques for NVIDIA Kepler and Fermi Architecture

Learn how to optimize your algorithms for the Fermi and Kepler architectures. This 60 minute webinar focuses on key optimization strategies for compute and memory bound problems. The session includes techniques for ensuring peak utilization of CUDA cores by choosing the optimal block size and using dynamic parallelism on the Kepler architecture. For compute bound algorithms I discuss how to improve branching efficiency, loop unrolling, instruction level parallelism and dynamic parallelism. For memory bound algorithms, I present optimal access patterns for global and shared memory, highlighting the differences between the Fermi and Kepler architecture.

This webinar was presented live as part of NVIDIA’s GTC Express Program which is a great learning resource. Many thanks to Donal and his team at NVIDIA who made this webinar possible.

Kepler’s Shuffle Instruction

Understanding the CUDA memory model and utilizing it effectively is often key in achieving high performance from your NVIDIA GPU. The shuffle instruction, available on Kepler devices (compute 3.0 and newer), is a new tool that programmers can add to their bags of tricks to further optimize memory performance.

Figure 1: The CUDA Memory Model
Figure 1 – CUDA Memory Model

The CUDA memory model, illustrated in Figure 1, consists of several different memory regions that are on and off chip, with varying scopes, latencies and bandwidths. The two fastest memory regions (lowest latency and highest bandwidth) are registers and shared memory. Registers are allocated by the compiler for each thread and therefore have the scope of a thread. Shared memory is explicitly defined by the programmer, allocated per block and has thread block scope. Since shared memory is visible to every thread in a block it is commonly used as a programmer controlled cache.

With the Kepler shuffle instruction we now have another way, in addition to shared memory, to share values between threads. So why would you want to use the shuffle instruction instead of shared memory? First, you can use the shuffle instruction to free up shared memory to be used for other data or to increase your occupancy. Secondly the shuffle instruction is faster than shared memory since it only requires one instruction versus three for shared memory (write, synchronize, read). Another potential performance advantage for shuffle is that relative to Fermi, shared memory bandwidth has doubled on Kepler devices but the number of compute cores has increased by 6x; therefore, the shuffle instruction provides another means to share data between threads and keep the CUDA cores busy with memory accesses that have low latency and high bandwidth. Finally, you might want to use the shuffle instruction instead of warp-synchronous optimizations (removing __syncthreads()).

The shuffle instruction allows a thread to read values stored in a register from a thread within the same warp. A warp is a group of 32 threads with consecutive thread index values. The general shuffle instruction, __shfl, returns the value stored in a register from any other thread. The source thread is identified by its lane index (laneID) which is the index of a thread within a warp and calculated as thread index % 32.

float __shfl(  float var,     // Variable you want to read from source thread
     int srcLane,             // laneID of the source thread
     int width=warpSize       // Division of warp into segments of size width  
);

Acceleware at RICE Oil & Gas HPC Workshop in Houston

Rice University, Houston, Texas was the venue of this year’s Oil and Gas High Performance Computing Workshop held February 28. A record number of attendees met to discuss, share and learn about the latest technology advancements and future industry challenges. Dirk Smit, Vice President Exploration Technology and Chief Scientist Geophysics at Shell, started off the workshop with an interesting keynote on the compute challenges of future energy demand, which paved the way for a variety of other speakers and presentations.

Scott Quiring, Software Developer and Seismic Team Lead at Acceleware, joined industry and academic leaders to present a case study on ‘Reverse Time Migration: A Study on GPU/CPU Hybrid Computing’ as part of the Accelerator Enabled Computing parallel session.

Scott’s presentation demonstrated the importance for RTM implementations to be continuously re-balanced as hardware performance and software features evolve over time in order to avoid performance bottlenecks and maintain optimum throughput. Other key points of his presentation included:

  • The basic methodology of RTM
  • RTM computational challenges including efficient propagation of a 3D seismic wavefield and accessing the wavefield in reverse-time order
  • How to maximise performance on hybrid (GPU/CPU) platforms

A recording of Scott’s presentation can be viewed here: mediahub.rice.edu

CUDA_VISIBLE_DEVICES – Masking GPUs

Do you need to target a specific GPU within your CUDA application? A quick and easy solution to this is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want to target a specific GPU.

Vegas, baby, Vegas!

The Society of Exploration Geophysicists (SEG) headed back to Las Vegas for their International Exposition and 82nd Annual Meeting. An impressive 8,412 delegates attended the event making the show the world's largest oil, energy and mineral exposition.

New in the Acceleware booth for 2012 were dynamic talks on the latest research developments for FWI, 3D elastic modeling, RTM and hardware acceleration advancements. Acceleware’s RTM Product Manager, Darren Foltinek and Seismic Researcher Team Lead, Dr. Ray McGarry teamed up with Ty McKercher, Principal Solution Architect from NVIDIA to deliver the presentations:

  • An Introduction to Full Waveform Inversion
  • AxRTM: An Innovative Reverse Time Migration Library
  • Beyond RTM, GPU Integration in the E&P Workflow
  • Evaluating Hardware Acceleration for your Workflow
  • Making 3D Elastic Modeling a Reality
  • Dispersion Reducing Finite Difference Coefficients for CSEM

If you missed the presentations or you would like a copy of the slides, drop us an email at services@acceleware.com

An Introduction to FWI
An Introduction to FWI

SuperComputing 2012 Wrap up

We set off to Salt Lake City for SuperComputing 2012 (SC12), trying to escape the snow and cold in Calgary only to land in a full blizzard at SLC airport. The snowfall didn't last very long though and we experienced the city in a beautiful mix of fall, winter and blue skies.

Acceleware - Salt Lake City by Night

Pages

Subscribe to RSS - blogs