Unified Memory on Tesla P100 with CUDA 8.0

Unified Memory on Tesla P100 with CUDA 8.0

Unified Memory, first introduced with Kepler GPUs and CUDA 6.0, simplifies CUDA programming by creating a pool of managed memory that is shared by both CPUs and GPUs.  Memory is allocated once, yielding a single pointer usable on both the CPU and GPU, and the system automatically migrates data between physical CPU memory and physical GPU memory.  Without Unified Memory, CUDA developers managed shared data explicitly.  This typically involves allocating CPU memory, allocating GPU memory, and then explicitly using cudaMemcpy() calls to copy data between the CPU and GPU allocations.  This is particularly evident when trying to implement complicated pointer-based C/C++ data structures (eg. linked lists), which require non-trivial deep copy operations to transfer between the CPU and GPU.  While the initial implementations of Unified Memory did simplify programming, they did so with considerable performance overhead compared to explicit memory management, as Mark Harris notes:

An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory. Understandably so: the CUDA runtime never has as much information as the programmer does about where data is needed and when! CUDA programmers still have access to explicit device memory allocation and asynchronous memory copies to optimize data management and CPU-GPU concurrency. Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.

Through a simple example, I’ll explore the latest Tesla P100 (Pascal) GPUs, with the addition of CUDA 8.0 API calls, to see how the performance of Unified Memory compares to explicit memory management.

 

Unified Memory with Kepler/Maxwell GPUs

First, we revisit Unified Memory on Kepler/Maxwell GPUs by comparing explicit memory management and Unified Memory for a simple element-wise vector addition kernel that computes C = A + B.  In this example, A, B, and C are vectors of 225 single precision floating-point elements.

Figure 1 shows the host code for the explicit memory management version.  We see the separate allocation of host arrays (aH, bH, cH), the allocations of device memory (aD, bD, cD), and explicit cudaMemcpy() calls to copy data between the host and the device.  The host function GenerateTestData() initializes aH and bH with randomly generated floats, sets cH to 0.0f, and computes a CPU reference solution refH.  The host function CompareData() compares the sum computed on the GPU to the CPU reference  solution.


Figure 1 - Vector Addition with Explicit Memory Management

 

Figure 2 shows the host code using Unified Memory.  We see a single allocation of a, b, and c using cudaMallocManaged() utilized on both the host and device.  Also note the addition of a cudaDeviceSynchronize()call after the SumArrays<<<>>>() kernel launch.  Since the kernel executes asynchronously relative to the host, we need explicit synchronization to guarantee the kernel has finished executing before we can execute the CompareData() function.  In fact, on Kepler/Maxwell devices, concurrent access to managed memory from the CPU and GPU is prohibited, and this is enforced by triggering a segmentation fault if the CPU attempts to access managed memory while the GPU is busy.

Vector Addition with Unified Memory
Figure 2 - Vector Addition with Unified Memory

 

Figure 3 shows an NVIDIA Visual Profiler timeline for the explicit memory management version running on a Tesla K80 GPU (Kepler) in Linux.  Figure 4 shows the Unified Memory version.  Table 1 shows a performance comparison between the two approaches.  With Unified Memory, the vectors are initially allocated in device memory, but first accessed on the host during the GenerateTestData() call.  The system migrates the arrays from D->H during the GenerateTestData() call, even though they are uninitialized!  This activity appears as CPU Page Faults and D->H migration events on the timeline.  This additional overhead increases the duration of the GenerateTestData() call in the Unified Memory version.  Before the SumArrays<<<>>> kernel is executed, the vectors are migrated from the host back to the device.  These transfers are slightly more efficient than the explicit equivalents, but the unified memory version transfers all three vectors, whereas the explicit memory version only transfers the two input vectors.  The overhead of the page faults and implicit D->H migrations of the result vector increases the duration of CompareData()in the Unified Memory version.  

NVIDIA Visual Profiler Timeline Explicit Memory Tesla K80

Figure 3 - NVIDIA Visual Profiler Timeline for Explicit Memory Management on Tesla K80

 

 

NVIDIA Visual Profiler Timeline Unified Memory Tesla K80

Figure 4 - NVIDIA Visual Profiler Timeline for Unified Memory on Tesla K80

 

 

Table 1 - K80 Performance Comparison

Unified Memory with Pascal

One of the significant new features on the Tesla P100 is the Page Migration Engine which uses a CPU-like page fault mechanism to manage accesses to managed memory.  Additionally, Pascal enables concurrent access to data from host and device, and the page fault mechanism supports over subscription of device memory.  See Mark Harris’s CUDA 8 Features Revealed for more details.

Figure 5 shows an NVIDIA Visual Profiler timeline for the explicit memory management version running on a PCIe Tesla P100 GPU in Linux.  Figure 6 shows the Unified Memory version.  Table 2 shows a performance comparison between explicit memory management and Unified Memory.  With Tesla P100 Unified Memory, managed memory is initially allocated on first touch.  In our case, that occurs during GenerateTestData().    We still see page faults on the timeline, but unlike with Kepler, where managed allocations were initially allocated on the device, there are no needless migrations from D->H.  When the kernel executes, accesses to the vectors trigger GPU page faults resulting in H->D migrations of pages over PCIe at a theoretical peak of ~16GB/s.  For a memory bandwidth limited kernel like vector addition, this results in a significant slowdown compared to the explicit memory management version, where the vectors are resident in the GPUs high-performance HBM2 memory and accessed at a theoretical peak of ~732GB/s.  cudaDeviceSynchronize() is still required between the kernel launch and CompareData() for correctness, and CompareData() behavior is unchanged.  

NVIDIA Visual Profiler Timeline for Explicit Memory Management on Tesla P100

Figure 5 - NVIDIA Visual Profiler Timeline for Explicit Memory Management on Tesla P100

 

NVIDIA Visual Profiler Timeline for Unified Memory Management on Tesla P100

Figure 6 - NVIDIA Visual Profiler Timeline for Unified Memory Management on Tesla P100

 

 

Table 2 - Performance Comparison on Tesla P100

Performance Comparison on Tesla P100

 

New CUDA 8.0 Unified Memory API Calls

While Unified Memory does offer simplified programming compared to explicit memory management, the performance comparisons demonstrate that this simplicity comes at considerable expense, even with the new Page Migration Engine on Pascal GPUs.  Fortunately, CUDA 8.0 introduces the cudaMemAdvise() API call to provide the runtime with memory usage hints, and cudaMemPrefetchAsync()to explicitly prefetch data instead of waiting for page faults to trigger migrations.  These new API calls allow all the advantages and programming simplicity of Unified Memory, while matching the performance of explicit memory management.

Figure 7 adds cudaMemPrefechAync()calls to prefetch vectors instead of waiting for page faults to trigger page migrations.  These API calls are supported for devices that can coherently access managed memory concurrently with the host, which is programmatically verifiable via the concurrentManagedAccess property.  Figure 8 shows the profiler timeline.  Table 3 shows the performance comparison updated with results using prefetch.  With the addition of the prefetch calls, there are no H->D migrations during kernel execution, or D->H migrations during CompareData().  The Unified Memory version matches the explicit memory version except for the additional page fault overhead during initial allocation.  For more realistic and complicated algorithms, that would be a onetime expense.

Vector Addition with Unified Memory & cudaMemPrefetchAsync()
Figure 7 - Vector Addition with Unified Memory and cudaMemPrefetchAsync()

 

NVIDIA Visual Profiler Timeline of Tesla P100 Unified Memory with Prefetch

Figure 8 - NVIDIA Visual Profiler Timeline of Tesla P100 Unified Memory with Prefetch

 

 

Table 3 - Prefetch Performance on Tesla P100

Prefect Performance on Tesla P100

Conclusions

With Pascal GPUs and new CUDA 8.0 APIs, Unified Memory offers simplified programming AND matches the performance of explicit memory management.

The .NVVP files are available so that you can load the profiler results.

Download NVVP Files