Acceleware's Return to Blogs!

Acceleware Professional Services & Training Update

Welcome back to the Acceleware blog!

Acceleware has been extremely busy over the past year working on professional services projects and hosting CUDA/OpenCL training sessions. We added subgridding to our FDTD product and angle gathers to our RTM software.  In RF Heating, we have been actively developing new antenna designs and conducting field tests. 

We attended some of our favorite shows including EAGE, SEG, and GTC in 2016 and plan to attend them again this year.  We are very excited to continue engaging with you on professional services and training in 2017. 

In an effort to revive the blog, we plan to introduce a technical tip and/or share some of our experiences with high performance computing each month.  To get started, I wanted to follow up with a couple of questions we received from our webinar last year.  You can find a recording of the webinar on NVIDIA’s website: http://on-demand.gputechconf.com/gtc/2016/webinar/catch-up-on-cuda.mp4

Can I Run Multiple Kernels on the GPU at the Same Time on an NVIDIA GPU?

Yes. However in most cases, kernels typically execute serially.  Each streaming multi-processor (SM) is assigned blocks of threads to process. If two kernels are launched in series, all of the blocks would run the first kernel to completion prior to starting the second kernel.  The following diagram shows how kernels are executed serially:

 

Kernels Executed Serially

 

Asynchronous memory transfers and concurrent kernel execution are enabled through the use of streams.  A stream is a sequence of operations that execute in order.  However, there are no ordering constraints between different streams!  This allows for concurrent kernel execution. You can read more about streams in the CUDA C Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/#streams).

In the diagram below, none of the kernels, except Kernel2, have blocks to occupy the SMs for a given unit of time.  In this scenario, there is the opportunity for concurrent kernel execution to fully occupy all of the streaming multiprocessors.

Serial Kernels vs Concurrent Kernels

If Kernel2 is launched after Kernel1 in the host code in the same stream, there is an assumed dependency and the kernels are executed serially.  However, if the there is no data dependency between kernels, concurrent kernel execution may be desirable.  To use this feature, kernels must be launched from the same process on a different non-default stream.  The GPU scheduler is then able to execute the kernels concurrently if the resources are available to do so. 

In practice, this feature works best if the kernel launches are only a few blocks.  Large kernels tend to use most of the GPU resources, so kernel concurrency yields limited benefit.

Can Threads Share Memory?

Yes.  Threads in the same block can use shared memory to exchange data.  Shared memory is a block of memory located on the streaming multiprocessor.  The amount of memory is relatively small ranging in size from 16KB to 112KB and depends on the compute capability and GPU configuration.  Each block can access a maximum of 48KB of shared memory.

Shared memory is particularly useful to avoid redundant computations and memory accesses.  Compared to global memory, accessing shared memory is substantially quicker. 

As an example, if each thread had to average three numbers, the steps would be as follows:

  1. Have each thread load a value and store it into shared memory
  2. Synchronize
  3. Add the values and divide
  4. Store the result in global memory

The corresponding GPU kernel code would look something like this:

GPU Kernel Code

 

Before a thread can access data in shared memory, the function __syncthreads must be used before accessing values written to shared memory by other threads.  Otherwise, the results read by the threads in line #3, might contain invalid data.  You can read about the details of __syncthreads in the CUDA C Programming Guide (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions).

Finally, this algorithm would require additional logic to handle the ‘halo’ surrounding the region of interest.  For example, if local_idx == 0, the index on line #3 would access the element at index [-1].

Data exchange between different blocks would require threads to be scheduled concurrently, which is not guaranteed across thread blocks. However, you can use atomics to have different threads modify the same memory location without race conditions.  Atomics can be used on both global and shared memory.  However, atomics result in serialization if threads contend for the same memory, which may slow down the kernel.  Details about atomic operations can be referenced here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomic-functions

Check out our programming tips next month and in the meantime please sign up to receive updates so you don’t miss out.  Thanks for reading!