Tesla Meets Maxwell
Tesla M40 and Tesla M60 - A New Epoch for GPU Computing
One scientific epoch ended and another began with James Clerk Maxwell
NVIDIA has recently launched new Tesla GPUs based on the Maxwell architecture.
Table 1 lists key specifications for the new Maxwell-based Tesla M40 and M60 compared to the Kepler-based Tesla K10, K40 and K80.
Table 1 - Key Specifications of NVIDIA Tesla GPUs
|Tesla K10||Tesla K40||Tesla M40||Tesla M60||Tesla K80|
|Cores||3072 (2x1536)||2880||3072||4096 (2x2048)||4992 (2x2496)|
|Memory (GB)||8 (2x4)||12||12||16 (2x8)||24 (2x12)|
|Memory Bandwidth (GB/s)||320 (2x160)||288||288||320 (2x160)||480 (2x240)|
|Peak Single Precision (TFlops)||4.580||5.040||6.84||9.7||8.74|
|Peak Double Precision (TFlops)||0.190||1.680||0.213||0.3||2.91|
|Double Precision:Single Precision Throughput Ratio||1:24||1:3||1:32||1:32||1:3|
|Maximum Shared Memory / Streaming Multiprocessor (KB)||48||48||96||96||112|
|Registers / Streaming Multiprocessor (KB)||256||256||256||256||512|
|Cooling Solution||Passive||Passive / Active||Passive||Passive / Active||Passive|
The Tesla M60 GPU card, like the K10 and K80, features two GPUs on a single PCIe card. Your CUDA applications have to be designed to work with multiple GPUs to leverage all the resources on these cards.
If your applications utilize predominantly single precision floating-point arithmetic, keep in mind that the Maxwell architecture is significantly more efficient that Kepler. For example, our benchmarks of single precision dense matrix-matrix multiplication routines in cuBLAS (SGEMM) sustain ~87% of peak throughput on Maxwell, compared to ~75% on a Kepler.
The Maxwell architecture is not optimized for double precision arithmetic. You can expect the K40 and K80 to outperform the new M40 and M60 for workloads that require high throughput double precision arithmetic.
The new Tesla GPUs support Compute Capability 5.2, which provides Dynamic Parallelism and HyperQ, like the K40/K80. They also feature 96KB of shared memory per streaming multiprocessor, which doubles the maximum amount compared to Compute 3.0 and 3.5 devices.
The Tesla K80 is still a compelling offering. In addition to unmatched double precision performance, it provides higher global memory bandwidth. The K80 also provides the most shared memory per multiprocessor, as well as twice the shared memory bandwidth per streaming multiprocessor. This is because the Kepler shared memory banks are twice as wide as Fermi/Maxwell. The K80 also provides twice as many registers per multiprocessor, which improves efficiency in some applications.
The Tesla M40 comes with a passive cooling solution, so it can only be installed in servers and workstations specifically designed to provide airflow to cool the GPU. The Tesla M60 comes in active and passive variants. The actively cooled variant of the M60 can be installed in any server or workstation with sufficient space and power.