Tesla Meets Maxwell

Tesla M40 and Tesla M60 - A New Epoch for GPU Computing

NVIDIA Tesla M40 GPU

One scientific epoch ended and another began with James Clerk Maxwell

Albert Einstein

NVIDIA has recently launched new Tesla GPUs based on the Maxwell architecture.
Table 1 lists key specifications for the new Maxwell-based Tesla M40 and M60 compared to the Kepler-based Tesla K10, K40 and K80.

Table 1 - Key Specifications of NVIDIA Tesla GPUs

  Tesla K10 Tesla K40 Tesla M40 Tesla M60 Tesla K80
Architecture Kepler Kepler Maxwell Maxwell Kepler
Cores 3072 (2x1536) 2880 3072 4096 (2x2048) 4992 (2x2496)
Memory (GB) 8 (2x4) 12 12 16 (2x8) 24 (2x12)
Memory Bandwidth (GB/s) 320 (2x160) 288 288 320 (2x160) 480 (2x240)
Peak Single Precision (TFlops) 4.580 5.040 6.84 9.7 8.74
Peak Double Precision (TFlops) 0.190 1.680 0.213 0.3 2.91
Double Precision:Single Precision Throughput Ratio 1:24 1:3 1:32 1:32 1:3
Maximum Shared Memory / Streaming Multiprocessor (KB) 48 48 96 96 112
Registers / Streaming Multiprocessor (KB) 256 256 256 256 512
Compute Capability 3.0 3.5 5.2 5.2 3.7
Cooling Solution Passive Passive / Active Passive Passive / Active Passive

 

The Tesla M60 GPU card, like the K10 and K80, features two GPUs on a single PCIe card.  Your CUDA applications have to be designed to work with multiple GPUs to leverage all the resources on these cards.

If your applications utilize predominantly single precision floating-point arithmetic, keep in mind that the Maxwell architecture is significantly more efficient that Kepler. For example, our benchmarks of single precision dense matrix-matrix multiplication routines in cuBLAS (SGEMM) sustain ~87% of peak throughput on Maxwell, compared to ~75% on a Kepler.

The Maxwell architecture is not optimized for double precision arithmetic.  You can expect the K40 and K80 to outperform the new M40 and M60 for workloads that require high throughput double precision arithmetic.

The new Tesla GPUs support Compute Capability 5.2, which provides Dynamic Parallelism and HyperQ, like the K40/K80. They also feature 96KB of shared memory per streaming multiprocessor, which doubles the maximum amount compared to Compute 3.0 and 3.5 devices.

The Tesla K80 is still a compelling offering. In addition to unmatched double precision performance, it provides higher global memory bandwidth. The K80 also provides the most shared memory per multiprocessor, as well as twice the shared memory bandwidth per streaming multiprocessor. This is because the Kepler shared memory banks are twice as wide as Fermi/Maxwell. The K80 also provides twice as many registers per multiprocessor, which improves efficiency in some applications.

The Tesla M40 comes with a passive cooling solution, so it can only be installed in servers and workstations specifically designed to provide airflow to cool the GPU. The Tesla M60 comes in active and passive variants. The actively cooled variant of the M60 can be installed in any server or workstation with sufficient space and power.