Blogs

Timeout Detection in the Windows Display Driver Model when Running CUDA Kernels: Symptoms, Solutions, and Registry Modifications

Timeout Detection in the Windows Display Driver Model when Running CUDA Kernels:  Symptoms, Solutions, and Registry Modifications

A Simple Trick To Pass Constant Arguments Into GPU Kernels

A Simple Trick To Pass Constant Arguments Into GPU Kernels

Most CUDA developers are familiar with methods of passing constant arguments into GPU kernels.  The simplest method is directly via kernel parameters and the other option is copying to constant memory.  Under certain circumstances though, there’s another lesser-known way to get constants into your GPU kernel, that may even improve kernel performance!  

Unified Memory on Tesla P100 with CUDA 8.0

Unified Memory on Tesla P100 with CUDA 8.0

Acceleware's Return to Blogs!

Acceleware Professional Services & Training Update

Welcome back to the Acceleware blog!

Acceleware has been extremely busy over the past year working on professional services projects and hosting CUDA/OpenCL training sessions. We added subgridding to our FDTD product and angle gathers to our RTM software.  In RF Heating, we have been actively developing new antenna designs and conducting field tests. 

Tesla Meets Maxwell

Tesla M40 and Tesla M60 - A New Epoch for GPU Computing

NVIDIA Tesla M40 GPU

One scientific epoch ended and another began with James Clerk Maxwell

Albert Einstein

SEG 2015 – New Orleans, October 18-23

Exhibiting at SEG 2015 in New Orleans

Acceleware Booth at SEG 2015

 

This past month Acceleware attended the Society of Exploration Geophysicists’ 85th Annual Meeting and International Exposition in New Orleans, Louisiana. This show brought in over 8,000 professionals from 70 different countries, and Acceleware presented multiple in-booth talks about our advancements in seismic imaging software and high performance computing for the oil and gas industry.

EAGE 2015 - Madrid, Spain

Exhibiting at EAGE 2015 in Madrid

Acceleware Booth at EAGE 2015

 

The annual European Association of Geoscientists & Engineers conference and exhibition took place in beautiful Madrid, Spain from June 1-4. In its’ 77th year, over 6,500 delegates attended this show, and Acceleware was there to take it all in. 

This year, the Acceleware booth had conducted dynamic talks on our latest developments focusing on 3 of our products; AxFWI, AxWave, and AxRTM.

 

AxFWI - A Revolutionary Modular FWI Platform

 

AxFWI is a revolutionary modular FWI platform that enables users to accelerate their research by integrating their own algorithms and code to a highly optimized RTM engine. The easy to use interface gives the user the control and flexibility required to run many different scenarios and yet benefit from a platform engineered for maximum performance.

FWI Formula

Opt-In L1 Caching of Global Loads on Some Kepler/Maxwell GPUs

Background

CUDA developers generally strive for coalesced global memory accesses and/or explicit ‘caching’ of global data in shared memory.  However, sometimes algorithms have memory access patterns that cannot be coalesced, and that are not a good fit for shared memory.  Fermi GPUs have an automatic L1 cache on each streaming multiprocessor (SM) that can be beneficial for these problematic global memory access patterns.  First-generation Kepler GPUs have an automatic L1 cache on each SM, but it only caches local memory accesses.  In these GPUs, the lack of automatic L1 cache for global memory is partially offset by the introduction of a separate 48 KB read-only (née texture) cache per SM.

Opt-In L1 Caching on Kepler GPUs

NVIDIA quietly re-enabled L1 caching of global memory on GPUs based on the GK110B, GK20A, and GK210 chips.  The Tesla K40 (GK110B), Tesla K80 (GK210) and Tegra K1 (GK20A) all support this feature.  You can programmatically query whether a GPU supports caching global memory operations using cudaGetDeviceProperties and examining the globalL1CacheSupported property.  Examining the Compute Capability alone is not sufficient; Tesla K20/K20x and Tesla K40 both support Compute Capability 3.5, but only the K40 supports caching global memory in L1.

Webinar: Essential CUDA Optimization Techniques

Join Chris Mason, Product Manager at Acceleware, and learn how to optimize your algorithms for NVIDIA GPUs. This informative webinar provides an overview of the improved analysis performance tools available in CUDA 6.0 and key optimization strategies for compute, latency and memory bound problems. The webinar includes techniques for ensuring peak utilization of CUDA cores by choosing the optimal block size. For compute bound algorithms Chris discusses how to improve branching efficiency, intrinsic functions and loop unrolling. For memory bound algorithms, optimal access patterns for global and shared memory are presented, including a comparison between the Fermi and Kepler architectures.

Pages

Subscribe to RSS - blogs