
Crusoe is providing access to private preview of two managed services based on graphics processing units (GPUs) from NVIDIA that automate the management of infrastructure for artificial intelligence (AI) workloads.
The company already provides access to GPUs on demand and is now expanding its reach to include managed services, says Nadav Eiron, senior vice president of Cloud Engineering for Crusoe.
Crusoe Managed Inference makes it possible to deploy machine learning models without requiring application developers and data science teams to configure infrastructure resources themselves. Instead, they can send a prompt that leverages a large language model (LLM) to invoke an application programming interface (API) that Crusoe exposes.
The Crusoe AutoClusters managed service, meanwhile, provides higher degrees of fault tolerance using an orchestration capability built into the Crusoe cloud service. “It provides a higher level of abstraction,” says Eiron.
That orchestration capability is compatible with Slurm, Kubernetes and other tools and platforms so that IT teams can invoke an API, command line interface (CLI) or graphical user interface (GUI) to automatically create GPU clusters that access NVIDIA Quantum-2 InfiniBand Networks and a filesystem provided by VAST Data.
Crusoe also provides intelligent error detection and automated troubleshooting, including node replacement and programmatic substitution with spare capacity, to minimize downtime.
Finally, Crusoe provides access to NVIDIA Data Center GPU Manager (DCGM) and additional proprietary tools to enable IT teams to monitor their environments.
It’s not clear how many organizations will opt to rely on managed services for AI infrastructure versus preferring to manage it themselves or rely on a third-party managed service provider (MSP). In some cases, a data science team that includes some IT infrastructure expertise might invoke a managed service. In other instances, it might be a centralized IT team.
The one thing that is certain is the total cost of AI is steadily increasing, which inevitably will lead to an increased focus on optimization of consumption. In fact, there is already a nascent effort to extend best FinOps practices that are elsewhere to optimize workloads in the cloud to AI workloads. The simple fact is organizations are not going to be able to fully fund every AI experiment; so the need to ensure infrastructure is being consumed judiciously is now crucial.
Of course, there will also inevitably come a day when AI is being applied to achieve that goal. AI agents, for example, will be specifically trained to not only identify wasted resources but also, when directed, automatically optimize consumption of those resources.
In the meantime, getting the most of scarce GPU resources is increasingly becoming the responsibility of IT operations teams. Data science teams will continue to focus on training AI models but as inference models are deployed, the infrastructure management expertise of IT operations teams will need to be relied on more.
The challenge, as always, will be melding multiple cultures to ensure that handoffs between those teams occur with the least amount of friction possible to ensure AI applications are built, deployed and continuously updated as frequently as required.