GPU Hardware System Engineering – A Debugging Story

GPU Hardware System Engineering – A Debugging Story

The Performance Problem

Earlier this year, a customer had a most unusual problem. They bought a workstation with a P100 GPU to run finite difference time domain (FDTD) simulations. Initially, the simulation ran at the expected speed but over time, the GPUs would slow down to approximately 1/3 of the expected speed. From a support perspective, we have never an issue like this. Was it a software bug? Why would the simulation slow down? Ruling out a software bug, I focused my efforts on the system configuration. 

Preliminary Debugging

The customer sent me the OS configuration, NVIDIA driver version, a sample simulation and software version information. Even though they are running a P100, I asked them to check the NVIDIA driver model, to determine whether the GPU was set to TCC or WDDM [ As expected, the GPU was running in TCC mode. I considered the fact that Windows 10 is continuously under development, so perhaps a new update or perhaps a lack of updates was causing this. After ensuring the OS was updated, I elevated the issue to our software development team. My best guess was that the customer must have found a bug in our code related exclusively to the Pascal architecture. As a final attempt, I asked the customer to upgrade to the latest BIOS. As expected, BIOS updates did nothing to resolve the problem.

The Debugging Plot Thickens

Our software development lead denied that this is Pascal related bug and believed it to be a hardware issue. Acceleware has a limited number of P100 GPUs, so hardware access is limited. Initally, I gave the software developer a K10 in Windows Server 2012 R2 to debug on. After making no progress, the software developer insisted that I get him access to the P100 in Windows 10, mirroring the customer’s workstation as closely as possible. Our server is a two socket Dell R730 server with two P100s. The customer’s system has only one P100. Since we didn’t want to take one P100 out for testing, we used CUDA_VISIBLE_DEVICES to run on one GPU only []. Finally, we were able to debug on a setup similar to the customer. During a short simulation run, the speed slowed down over time.  We were now able to reproduce the problem! (although it is not as pronounced as the customer’s case). Also, the slowdowns were different on the two P100s we isolated with the CUDA_VISIBLE_DEVICES.

Bad Hardware or Simply a Software Bug?

The performance drop made no sense to us; maybe we had a defective P100. We tried on the second P100 using CUDA_VISIBLE_DEVICES to select the card the other GPU. This time speed dropped off faster. Two defective P100s? Seems unlikely. However, we observed that our speed drop off wasn’t quite as bad as the customers; we went from full speed to 90% of full speed. We have a NVIDIA Titan X Pascal in house that is almost the same as a P100. Testing a similar setup with our Titan X Pascal didn’t result in any performance issues. The software developer and I concluded that we must have a P100 related bug in our FDTD software. Debugging this will be a difficult and time consuming task as it is specific to the card. We relayed the bad news to our customer.

A Breakthrough!

We thought of one last idea before intensively debugging the software. We decided to run nvidia-smi to log the GPU clock speed. P100s and many other NVIDIA GPUs have variable clock speeds to conserve power and increase performance. (While the P100 has several clock domains, for simplicity we are graphing only the graphics clock) The following command logs GPU temperature and power draw every 5 seconds until the command is killed.

C:\"Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" --query-gpu=timestamp,name,pci.bus_id,temperature.gpu,, --format=csv -l 5

Figure 1
Nvidia-smi output


Figure 2


Initially, the P100 was running at 405MHz, the idle speed. Once the CUDA program started it ran at 1189 MHz. It stayed at that speed while the simulation is initialized on the GPU. Once the GPU began time-stepping through the simulation, the clock rate increased to 1328MHz. At this point, the GPU temperature began rising once the simulation time-stepping begins. A little later, the GPU temperature reached 79C and the clock was throttled to keep the GPU from overheating. This occurred because the cooling was insufficient. As the simulation progressed, the clock throttling increased. The clock reached as low as 898MHz, almost a 1/3 drop in speed from the boost clock. 

Why did it drop? I decided to use the Dell iDrac interface and increased the fan speed of the server to 100%. Running the simulation again, the GPU clock rate was consistent, as was the simulation speed. At that point it became obvious, the P100 was throttling its clock rate under load as the cooling was insufficient. This also explained the response of the two P100 cards for simulation speed. They had a slightly different airflow which caused different throttling rates. Figure 1 shows how one P100 is always running cooler than the other. I asked the customer to change the fan speed for their workstation. That solved the problem.

Wrapping Up

Why did this happen? Both our Dell server and client’s workstation were on NVIDIA’s certified server’s list for P100 GPUs []. The latest BIOS was installed on both computers. The customer was running their workstation in an office environment and not in a dedicated server room. As a result, their GPU clock throttling was even more significant than our tests. Their clock ran as low as 645MHz, half the peak clock. The reason our issue occurred is that NVIDIA Tesla GPUs are intended for large datacenters. Datacenters have a fixed cooling and power budget. This is often managed by a dedicated team. For those that have just one or two P100s, we needed to better understand the environment required to ensure the P100 can run to its maximum potential. In this case, our server fan speed needed adjustment, as the customer was running a P100 in a less than ideal environment. As for our server room, we later found out that one of our two AC units needed servicing. Adjusting the server fan speed solved the problem, but the real reason for the slowdown was an environmental one.