With next-gen chips pushing 700W, thermal is the hot topic
SC22 It’s safe to say liquid cooling was a hot topic at the Supercomputing conference in Dallas this week.
As far as the eye could see, the exhibition hall was packed with liquid-cooled servers, oil-filled immersion cooling tanks, and all the fittings, pumps, and coolant distribution units (CDUs) you might need to deploy the tech in a datacenter.
Given that this is a conference all about high-performance computing the emphasis on thermal management shouldn’t really come as a surprise. But with 400W CPUs and 700W GPUs now in the wild, it’s hardly an HPC or AI exclusive problem. As more enterprises look to add AI/ML capable systems to their datacenters, 3kW, 5kW, or even 10kW systems aren’t that crazy anymore.
So here’s a breakdown of the liquid-cooling kit that caught our eye at this year’s show.
The vast majority of the liquid-cooling systems being shown off at SC22 are of the direct-liquid variety. These swap copper or aluminum heat sinks and fans for cold plates, rubber tubing, and fittings.
If we’re being honest, these cold plates all look more or less the same. They’re essentially just a hollowed-out block of metal with an inlet and outlet for fluid to pass through. Note, we’re using the word “fluid” here because liquid-cooled systems can use any number of coolants that aren’t necessarily water.
In many cases, OEMs are sourcing their cold plates from the same vendors. For instance, CoolIT provides liquid-cooling hardware for several OEMs, including HPE and Supermicro.
However, that’s not to say there isn’t an opportunity for differentiation. The inside of these cold plates are filled with micro-fin arrays that can be tweaked to optimize the flow of fluid through them. Depending on how large, or how many dies there are to cool, the inside of these cold plates can vary quite a bit.
Most of the liquid-cooled systems we saw on the show floor were using some kind of rubber tubing to connect the cold plates. This means that liquid is only cooling specific components like the CPU and GPU. So while the bulk of the fans can be removed, some airflow is still be required.
Lenovo’s Neptune and HPE Cray’s EX blades were the exception to this rule. Their systems are purpose built for liquid cooling and are packed to the gills with copper tubing, distribution blocks, and cold plates for everything, including CPU, GPU, memory, and NICs.
Using this approach, HPE has managed to cram eight of AMD’s 400W Epyc 4 Genoa CPUs into a 19-inch chassis.
Meanwhile, Lenovo showed off a 1U Neptune system designed to cool a pair of 96-core Epycs and four of Nvidia’s H100 SXM GPUs.