Real-Time NUMA Node Performance Analysis Using Intel Performance Counter Monitor

Real-Time NUMA Node Performance Analysis Using Intel Performance Counter Monitor

A typical method of analyzing code performance is to use profiling tools such as gprof or Intel VTune Amplifier.  However, for real-time analysis of hardware (such as CPU and memory) utilization during program execution, profiling tools offer limited or non-existent capabilities, and real-time performance counters built-in into Windows (Resource Monitor) or Linux (top command) do not provide much detail either, especially when we start using NUMA nodes. Thankfully, Intel offers a great set of utilities for this purpose, called the Intel Performance Counter Monitor!  In this blog post, we use several Intel Performance Counter Monitor (PCM) utilities to diagnose less-than-optimal performance in two examples:

  1. Sub-optimal NUMA node usage
  2. CPU caching issues

For those who are new to NUMA, NUMA stands for Non-Uniform Memory Access.  NUMA is an architecture used in systems with multiple CPUs such as multi-socket x64 servers, where each CPU has access to local memory and remote memory (generally remote memory for one CPU will be local memory for another CPU.)  This results in different memory access times for each CPU, depending on which kind of memory it wants to access.  Remote memory accesses will have higher latency and possibly lower bandwidth compared to local memory accesses, because they require socket-to-socket communication. For more detailed information about NUMA architectures, take a look at https://en.wikipedia.org/wiki/Non-uniform_memory_access.

In our examples, we execute on our NUMA test machine, the physical configuration of which is illustrated in Figure 1 using the lstopo tool, provided with the Portable Hardware Locality (hwloc) software package.  For those of you unfamiliar with hwloc, it is a set of tools that provide an abstraction of the hierarchical topology of modern architectures (such as NUMA).  It is portable across operating systems, so it can run on Windows and Linux.  You can learn more about hwloc and download it at https://www.open-mpi.org/projects/hwloc.

Our test machine has two CPU sockets, each occupied by an Intel Xeon E5-2650 v3 CPU. Each Xeon CPU has 10 physical cores (labelled as Core P #X in lstopo), and can run up to 20 threads (labelled as PU P#X in lstopo) total using Hyper-Threading.  In this case, each socket is its own NUMA node.  Each socket/NUMA node has 128GB of local memory. All of the processing units in NUMA node 0 (even-numbered PUs) can access the 128GB of memory located in NUMA node 1 and vice-versa, but that would be a remote access with all the associated penalties.

 

Istopo output of test machine

Figure 1. lstopo output of test machine.

NUMA node usage analysis using pcm-numa and pcm-memory

In this example, to illustrate some of the metrics we see in pcm-numa and pcm-memory, we have a program that executes on all cores of NUMA node 0.  It allocates memory on node 0 and node 1, and then initializes the memory on node 1.  It then reads the memory from node 1, performs basic arithmetic, and stores the result in the memory allocated in node 0.  This example was written for linux and uses libnuma, however if you wish to run something similar in Windows, the Windows API offers similar functionality (see an example here).

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    #pragma omp parallel
    {
        numThreads = omp_get_num_threads();
    }

    size_t blockSize = 7240 * 7240;
    size_t numElements = blockSize * numThreads;
    size_t memSize = numElements * sizeof(float);
    float* node0Mem = (float*)numa_alloc_onnode(memSize, 0);
    float* node1Mem = (float*)numa_alloc_onnode(memSize, 1);

    // Phase 1: Initializing memory on NUMA node 1.
    std::cout << "Initializing memory on non-local NUMA node." << std::endl;
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        for (int steps = 0; steps < 100; steps++)
        {
            for (size_t i = tid * blockSize; i < (tid + 1) * blockSize; i++)
            {
                node1Mem[i] = i + steps;
            }
        }
    }

    // Phase 2: Performing arithmetic on NUMA node 1, and storing the result in node 0.
    std::cout << "Working on local memory and non-local memory." << std::endl;
    #pragma omp parallel
    {
        int tid = omp_get_thread_num();
        for (int steps = 0; steps < 100; steps++)
        {
            for (size_t i = tid * blockSize; i < (tid + 1) * blockSize; i++)
            {
                node0Mem[i] = node1Mem[i] + steps;
            }
        }
    }
    numa_free(node0Mem, memSize);
    numa_free(node1Mem, memSize);

 

pcm-numa

The pcm-numa tool displays the instructions-per-cycle (IPC), number of instructions, number of CPU cycles, local DRAM accesses, and remote DRAM accesses for each core. 

During Phase 1 of the code where we initialize memory on NUMA node 1 from node 0, pcm-numa (Figure 2) shows us an increased number of remote DRAM accesses on the corresponding cores (even-numbered cores from 0 to 39.)  It also displays a relatively low IPC count (~0.5), due to the inefficient memory access.

 
 
Figure 2. pcm-numa output during phase 1
 
 

Pcm-memory

The pcm-memory tool displays detailed performance metrics for the physical memory installed in each NUMA node, such as read/write speeds, and total system memory throughput.  This tools enables us to determine if we are maximizing our system memory performance.

 

Figure 3. pcm-memory output during phase 2

In Figure 3 we can see the output of pcm-memory during phase 2.  The tool shows us that in NUMA node 1 (labelled Socket 1) we have a series of reads from memory, which we then use to write the results in node 0’s local memory.

 

We can correct the program to initialize all memory on node 0 instead of node 1, such that all memory accesses are local:
float* node1Mem = (float*)numa_alloc_onnode(memSize, 0);

 

Re-compiling and re-running the program, we can see the results of our labour in pcm-numa, via an increased IPC count (Figure 4), in the even-numbered cores.  Note the local memory accesses shown as well.  The odd-numbered cores remain unused in the program, as they belong on node 1.

Figure 4. pcm-numa output when all arrays are allocated on node 0

CPU cache usage monitoring using pcm

In our second example, we create a representation of a 2D matrix using a 1D array allocation.  We initialize the matrix, and then perform two operations:

  • Compute the square of each matrix element in a given row, and store the sum of each square in that row in an array.  
  • Compute the square of each matrix element in a given column, and store the sum of each square in that column in an array.  
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
    size_t matrixDim = 50000;
    float* matrix1D = new float[matrixDim * matrixDim];
    float* answer = new float[matrixDim];

    // initialize matrix
    for (size_t i = 0; i < matrixDim * matrixDim; i++)
    {
        matrix1D[i] = (float)i;
    }

    std::cout << "Performing efficient memory access." << std::endl;
    double startTime = omp_get_wtime();
    // row-major memory access
    for (int step = 0; step < 200; step++)
    {
        for (size_t i = 0; i < matrixDim; i++)
        {
            size_t rowOffset = i * matrixDim;
            for (size_t j = 0; j < matrixDim; j++)
            {
                answer[i] += matrix1D[rowOffset + j] * matrix1D[rowOffset + j];
            }
        }
    }
    double endTime = omp_get_wtime();
    std::cout << "Completed. Time elapsed: " << endTime - startTime << " seconds." << std::endl;

    std::cout << "Performing inefficent memory access." << std::endl;
    startTime = omp_get_wtime();
    // column-major memory access
    for (int step = 0; step < 20; step++)
    {
        for (size_t i = 0; i < matrixDim; i++)
        {
            for (size_t j = 0; j < matrixDim; j++)
            {
                size_t rowOffset1 = j * matrixDim;
                size_t rowOffset2 = (j + 1 >= matrixDim ? 0 : j + 1) * matrixDim;
                answer[i] += matrix1D[rowOffset1] * matrix1D[rowOffset2];
            }
        }
    }
    endTime = omp_get_wtime();
    std::cout << "Completed. Time elapsed: " << endTime - startTime << " seconds." << std::endl;

    delete[] matrix1D;
    delete[] answer;
 

As a quick refresher, in C++, arrays are stored and ordered in row-major order – that is, to proceed sequentially in a 2D representation of an array, we access consecutive array elements in a given row first before proceeding to the next row.  (See https://en.wikipedia.org/wiki/Row-_and_column-major_order for more detailed information.)

This has performance implications as well.  When a program reads an array value from memory (that is not already stored in cache), it also loads a number of subsequent array elements from memory into CPU cache, via a cache line. This allows for the potential of greater performance, as long as we read successive array elements in row-major order.  If we read array elements in column-major order, we lose the performance benefit of caches since the incorrect array elements will be stored in cache, resulting in fetching data from memory instead.

Note: As a reminder this example was tailored towards execution on an Intel Xeon E5-2650 v3, with 256KB of L2 cache for each core, 25MB of L3 cache shared amongst all cores.  Your results may vary depending on your CPU.

pcm-x

The pcm-x tool displays quite a bit of information.  Similarly to the pcm-memory tool, it shows information on a per-core basis.  For this example, we are interested in the cache-related metrics such as L3MISS and L2MISS (cache misses to L3 and L2 cache, respectively), and L3HIT and L2HIT (L3/L2 cache hit ratios). For the purposes of this example, it is also worth noting the FREQ metric indicates the ratio of current core clock frequency vs. nominal clock frequency.

Figure 5. Output of pcm-x during efficient cache usage

Looking at the output of pcm-x during the first operation in Figure 5 above, we see a 100% L2 cache hit rate from core 38, which we can identify as the core executing this program due to its entry in pcm-x being the only one having a >0 Freq number, as well as a high IPC count (all other cores were generally idle during execution.)   

In the second operation shown in Figure 6 below, we see a very different picture.  With each core having 256KB of L2 cache, and a 64 byte cache line, iterating through 50000 array elements in the wrong order quickly results in L2 cache never having the necessary data.  This is shown in pcm-x by a 0% L2HIT number, as well as the comparatively large L2MISS number.

 

Figure 6. pcm-x output during inefficient cache usage

While these examples have focused on CPU, memory, and cache usage in NUMA architectures, there are also definite applications to GPUs using CUDA as well!  On a NUMA node with GPUs running CUDA, for maximum data transfer performance between device and host you will typically want to allocate pinned memory on the NUMA node closest to that GPU.  Using the nvidia-smi topo -m  command will show you the affinity of each of your GPUs relative to each other and also CPU cores.    

 

Figure 7. nvidia-smi topo -m output.

Using the nvidia-smi tool in conjunction with conjunction with the hardware locality library and the Intel Performance Counter Monitor will go a long way in ensuring that your GPUs are communicating optimally with the right CPUs.  Hopefully these simple examples illustrate the potential usefulness of some of the Intel Performance Counter Monitor tools!  For more detail on multiple GPUs in a NUMA node, check out a presentation by NVIDIA at http://on-demand.gputechconf.com/gtc/2012/presentations/S0515-GTC2012-Multi-GPU-Programming.pdf.  

 

The code used in these examples can be downloaded below.