Professor Paris Smaragdis often needs neural networks for his audio-related research. The AI systems allow him, for example, to extract specific sounds from a noisy jumble—a feature that came in handy when extracting conversations for the 2021 Beatles documentary series Get Back.
The neural networks processing his data sets require fast-computing GPUs: specialized electronics, often housed in on-campus data racks.
But at the Grainger College of Engineering at the University of Illinois Urbana-Champaign, he’s not the only professor who needs the AI hardware.
“We have people doing a lot of computer vision, a lot of language processing, and these are folks that need to train really big models and really big data sets,” Smaragdis said.
Smaragdis said he has seen “exponential” growth in the use of GPUs among students and staff, especially his colleagues studying computer vision, language processing, and of course, AI.
“What’s the best way to serve a lot of people using a GPU cluster? We’re trying to figure those things out. And I think a lot of that responsibility falls on the IT folks,” he said.
Campus pros like Associate Computational Systems Analyst Kaiwen Xue ensure that Grainger’s roughly 2,000 researcher systems are equipped with the most helpful tools. Lately, that has meant setting staff up with Nvidia GPUs.
“Every time a researcher steps away from doing their research and has to focus their time on doing IT or administrative tasks, such as updating the system or updating their GPU drivers, that’s taking away work time,” Xue said.
Drivers wanted. A software program known as a driver allows a computer’s operating system and applications to communicate effectively with an installed graphics card or GPU.
There are a variety of GPUs on the U of I campus, each with their own drivers. Xue estimates the Grainger campus has over 900 GPUs and over 60 unique models of GPUs, ranging from the basic Nvidia Quadro P1000, to the more common Nvidia GeForce GTX 1080, to the more high-end Nvidia H100.
Xue deploys scripts, triggered during the system bootup, which download and install Nvidia driver updates to managed machines.
“What we’re trying to do is create an environment where the drivers can be updated, can be patched with security updates, while not influencing or impacting the researchers’ work,” Xue said.
Top insights for IT pros
From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.
Fully booked. Even the students toying with new large language models are a bit like gases lately, Smaragdis explained with a laugh, because when you give them lots of resources, they fill the space. Researchers also need GPUs, which leads to another difficult challenge: scheduling.
“We want to be able to give access to as many people as possible, and we don’t want to have a small percentage of people that basically hog the system,” Smaragdis said.
Xue uses a job scheduler, developed by the University of Wisconsin, called HTCondor Software Suite, which automates and manages workloads and computing resources. Resources of cluster nodes can be claimed via the tool.
But just like with conference rooms: Sometimes you don’t need to book the big one for your solo phone call. To address the challenge of enabling research while creating fair usage rules, Smaragdis said one effective idea so far has been to place short, experimental jobs on one GPU cluster and long-term, big-model jobs on another.
“We usually get together with our IT team and try to come up with a strategy that we think would make sense. Then, we get a lot of angry emails by the users, and then we try to adjust it. It’s an ongoing process,” Smaragdis told us.
Rack it up. The physical footprint of a GPU system is “measurably more impactful,” according to Xue. The traditional model of one physical, specialized workstation per researcher, he said, has given way to the rack-mounted data center. It’s getting hotter in the server room.
“Suddenly, we’re seeing a lot of empty lab spaces, a lot of empty graduate office spaces that aren’t running as many machines there, and a long queue of systems waiting to be installed into data centers,” Xue said.
To do GPU-enabled research today, all a researcher needs is a basic laptop to do a remote connection to a server—and some good IT pros managing workloads in the background.