Blogs

Webinar: Essential CUDA Optimization Techniques

Join Chris Mason, Product Manager at Acceleware, and learn how to optimize your algorithms for NVIDIA GPUs. This informative webinar provides an overview of the improved analysis performance tools available in CUDA 6.0 and key optimization strategies for compute, latency and memory bound problems. The webinar includes techniques for ensuring peak utilization of CUDA cores by choosing the optimal block size. For compute bound algorithms Chris discusses how to improve branching efficiency, intrinsic functions and loop unrolling. For memory bound algorithms, optimal access patterns for global and shared memory are presented, including a comparison between the Fermi and Kepler architectures.

Webinar Recording: An Introduction to OpenCL using AMD GPUs

Join Chris Mason, Product Manager at Acceleware, for an informative introduction to GPU Programming. The tutorial begins with a brief overview of OpenCL and data-parallelism before focusing on the GPU programming model. We also explore the fundamentals of GPU kernels, host and device responsibilities, OpenCL syntax and work-item hierarchy.

Webinar Recording: Asynchronous Operations & Dynamic Parallelism in CUDA

Join Chris Mason, Product Manager at Acceleware, as he leads attendees in a deep dive into asynchronous operations and how to maximize throughput on both the CPU and GPU with streams. Chris demonstrates how to build a CPU/GPU pipeline and how to design your algorithm to take advantage of asynchronous operations. The second part of the webinar focuses on dynamic parallelism.

Technical Paper: Modeling of Electromagnetic Assisted Oil Recovery

Presented at ICEAA - IEEE APWC 2014 this paper features an algorithm for the rigorous analysis of electromagnetic (EM) heating of heavy oil reservoirs. The algorithm combines a FDTD-based EM solver with a reservoir simulator. The paper addresses some of the challenges related to the integration of advanced electromagnetic codes with reservoir simulators and the necessary numerical technology that is needed to be developed, particularly as it relates to the multi-physics, translation of meshes, as well as petro-physical and EM material parameters. The challenges of calculating the electromagnetic dissipation in the vicinity of the antennas is also discussed. Example scenarios are presented and discussed.

Webinar Recording: GPU Architecture & the CUDA Memory Model

Join Chris Mason, Product Manager at Acceleware, and explore the memory model of the GPU! The webinar will begin with an essential overview of the GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. Chris will define shared, constant and global memory and discuss the best locations to store your application data for optimized performance. Features available in the Kepler architecture such as shared memory configurations and Read-Only Data Cache are introduced and optimization techniques discussed.

State of GPU Virtualization for CUDA Applications 2014

Introduction

Wide spread corporate adoption of virtualization technologies have led some users to rely on Virtual Machines (VMs). When these users or IT administrators wish to start using CUDA, often the first thought is to spin up a new VM. Success is not guaranteed as not all virtualization technologies support CUDA. A survey of GPU virtualization technologies for running CUDA applications is presented. To support CUDA, a VM must be able to present a supported CUDA device to the VM’s operating system and install the NVIDIA graphics driver.

GPU Virtualization Terms 

  • Device Pass-Through: This is the simplest virtualization model where the entire GPU is presented to the VM as if directly connected. The virtual GPU is usable by only one VM. The CPU equivalent is assigning a single core for exclusive use by a VM. VMware calls this mode virtual Direct Graphics Accelerator (vDGA).
  • Partitioning: A GPU is split into virtual GPUs that are used independently by a VM. 
  • Timesharing: Timesharing involves sharing the GPU or portion of between multiple VMs. Also known as oversubscription or multiplexing, the technology for timesharing CPUs is mature while GPU timesharing is being introduced. 
  • Live Migration: The ability to move a running VM from one VM host to another without downtime.

Virtualization Support for CUDA 

CUDA support from five virtualization technology vendors accounting for most of the virtualization market was examined. The five major vendors are VMWare, Microsoft, Oracle, Citrix and Red Hat. A summary is shown in the table below.

Webinar: An Introduction to CUDA Programming

NVIDIA GTC Express webinar recording from May 28, 2014.

Join Chris Mason, Product Manager at Acceleware, and explore the memory model of the GPU! The webinar will begins with an essential overview of the GPU architecture and thread cooperation before focusing on the different memory types available on the GPU. Chris will defines shared, constant and global memory and discusses the best locations to store your application data for optimized performance. Features available in the Kepler architecture such as shared memory configurations and Read-Only Data Cache are introduced and optimization techniques discussed.

Webinar: Accelerating FWI via OpenCL on AMD GPUs

Join Chris Mason, Acceleware Product Manager, as he presents a case study of accelerating a seismic algorithm on a cluster of AMD GPU compute nodes, for geophysical software provider and processor GeoTomo. The presentation will begin with an outline of the full waveform inversion (FWI) algorithm, followed by an introduction to OpenCL. The OpenCL programming model and memory spaces will be introduced. After a short programming example, Chris takes you step-by-step through the project phases of profiling, feasibility analysis and implementation. Chris shares the strategy for formulating the problem to take advantage of the massively parallel GPU architecture. Key optimizations techniques are discussed including coalescing and an iterative approach to handle the slices. Performance results for the GPU are compared to the CPU run times.

 

GPU Boost on NVIDIA’s Tesla K40 GPUs

What is GPU Boost?

GPU Boost is a new user controllable feature to change the processor clock speed on the Tesla K40 GPU. NVIDIA is currently supporting 4 selectable Stream Processor clock speeds and two selectable Memory Clock Speeds on the K40.  The base clock for the Stream Processors is 745MHz and the three selectable options are 666 MHz, 810 MHz and 875MHz (finally, a manufacturer not afraid of superstition!). The base Memory Clock frequencies are 3004MHz (default) and 324MHz (idle). Only the effects of tuning the Stream Processor clock are discussed as there is no application performance increase that results from adjusting the Memory Clock. This blog shows the impact of GPU Boost on a seismic imaging application (Reverse Time Migration) and an electromagnetic solver (Finite-difference time-domain).

GPU Boost is useful as not all applications have the same power profile. The K40 has a maximum 235W power capacity. For example, an application that runs at an average power consumption of 180W at the base frequency will have a 55W power headroom. By increasing the clock frequency, the application theoretically can take advantage of the full 235W capacity.

Enabling GPU Boost

GPU Boost is controlled using NVIDIA’s System Management Interface utility (nvidia-smi) with the following commands:

Command Explanation
nvidia-smi –q –d SUPPORTED_CLOCKS Show Supported Clock Frequencies
nvidia-smi –ac <MEM clock, Graphics clock> Set the Memory and Graphics Clock Frequency
nvidia-smi –q –d CLOCK Shows current mode
nvidia-smi –rac Resets all clocks
nvidia-smi –acp 0 Allows non-root to change clocks

Maximizing Shared Memory Bandwidth on NVIDIA Kepler GPUs

Shared Memory Configurations

On NVIDIA Kepler (Compute 3.x) GPUs, shared memory has 32 banks, with each bank having a bandwidth of 64-bits per clock cycle. On Fermi GPUs (Compute 2.x) shared memory also has 32 banks, but the bandwidth per bank is only 32-bits per clock cycle.

In Fermi GPUs, successive 32-bit words are assigned to successive banks. Kepler GPUs are configurable where either successive 32-bit words OR 64-bit words are assigned to successive banks. You can specify your desired configuration by calling cudaDeviceSharedMemConfig() from the host prior to launching your kernel, and specifying one of cudaSharedMemBankSizeDefault, cudaSharedMemBankSizeFourByte, or cudaSharedMemBankSizeEightByte.

In the eight byte configuration, bank conflicts occur when two or more threads in a warp request different 64-bit words from the same bank. In the four byte configuration, bank conflicts occur if two or more threads in a warp access 32-bit words from the same bank where those words span multiple 64-word aligned segments (Figure 1).

Figure 1 - Bank Conflicts in the Four Byte Configuration

Performance Implications

For kernels operating on double precision values, the eight byte configuration on Kepler allows loads/stores to consecutive doubles without bank conflicts. This is an improvement over Fermi GPUs, where accessing double precision values always incurred bank conflicts. If your kernels are operating on single precision values, you are only utilizing half the available shared memory bandwidth. Modifying your kernels to operate on 64-bit values in shared memory (perhaps using float2) may improve performance if shared memory throughput is a performance bottleneck for your kernel.

Pages

Subscribe to RSS - blogs