
Datadog just dropped a new tool designed to stop AI projects from bleeding cash on expensive GPUs. The new GPU Monitoring feature gives teams a single view of their fleet health, costs, and performance. It solves a major headache for companies scaling up their AI workloads. For many organisations, GPU and token costs are getting harder to control as they push for greater AI use. It risks any benefits they may gain from using AI.When AI performance slows, IT teams add more resources. If they can’t see why it has slowed, they are spending money for no reason.

Yanbing Li, Chief Product Officer at Datadog, said, “GPU instances account for 14 percent of compute costs, which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways.
“While these companies can see their costs climbing, they can’t chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways.”
Same old, same old
This is just a repetition of what happens when we change technology stacks. When we moved to the cloud, SaaS costs and licences exploded, and storage costs went through the roof. But organisations were told not to worry. The shift from CAPEX to OPEX would save them money. For many, it has, but only over a period of time. Early on, those OPEX budgets were raging out of control.
Look at what happened with SQL. Poorly designed queries that ate resources led to the development of query optimisers. When we moved to the cloud, where costs were calculated on input and output for databases, the costs became insane. It took time to locate and optimise that cost out, although many would say it is still there.
To solve this, there needs to be a shift in how we monitor. Rather than using basic health metrics, AI requires a view of where conflicts for resources are occurring. Where is the AI resource-bound? Why is it resource-bound? What is making the AI prompts and usage inefficient?
The size of the budget given over to AI is creating a boardroom conversation around cost. Again, we’ve seen this before with cybersecurity. Both currently look like money pits to the C-Suite as they wait for the benefits to appear.
Li continued, “Smartly managing AI spend becomes a board-level conversation when capacity is misallocated, training and inference workloads stall, and costs escalate.
“We all know managing GPU costs is a huge problem we need to solve, but most companies are experimenting with solutions, and it is still very difficult to get a single view of what is happening across the stack. GPU Monitoring fixes that with efficiency and reliability that we haven’t seen before.”
How will GPU monitoring change this?
The platform links the telemetry from each GPU to the workloads. That allows operations teams to see what is consuming resources and how much. They can then look at what needs to change. It could be rescheduling jobs, or it could be a sign of a failing GPU.
Datadog calls out four benefits for customers:
- Scale AI without overspending: With visibility and forecasting based on the usage patterns of fleets and direct guidance on whether to buy new GPUs or free up existing ones, platform teams avoid expensive purchases and long procurement cycles, machine learning teams get capacity faster, and leadership gets better ROI with predictable spend.
- Accelerate AI delivery: Stalled workloads are correlated directly to the underlying GPUs, pods and processes running them so that teams can troubleshoot performance bottlenecks in minutes instead of hours, allowing engineers to focus on shipping AI projects.
- Avoid costly disruptions: Unhealthy GPUs are proactively identified before failures cascade across a cluster and cause training and inference delays.
- Maximise ROI on GPU spend: Teams are empowered and accountable for their GPU utilisation and costs, and can easily pinpoint where they are over-reserving or underutilising their GPUs. This allows teams to reclaim and reallocate resources in order to reduce wasted spend.
Enterprise Times: What does this mean
The increased use of AI means that our IT operations stacks need to mature. Just throwing raw resources at a problem is not a solution. Organisations need to show an ROI, and that means knowing all the costs incurred and being able to identify inefficiencies.
This is what Datadog claims it is delivering. It is already seeing positive response from customers who are able to identify latency alerts and tie those to GPU metrics. That means they are reducing the cost of AI, which, in turn, will lead to higher usage and better ROI.
Datadog’s GPU Monitoring plugs a huge gap in AI infrastructure by giving teams visibility. Will we now see GPU management become a differentiator between businesses as they look for AI gains? We will only know how effective that is when Datadog publishes some customer white papers.
The post Datadog Tames AI GPU Spend appeared first on Enterprise Times.
Discover more from RSS Feeds Cloud
Subscribe to get the latest posts sent to your email.
