Datadog has announced the general availability of GPU Monitoring, a product designed to help organizations manage the escalating costs and performance demands of AI infrastructure. The company reports that GPU instances now account for approximately 14% of total compute costs, a figure that frequently creates budgeting challenges due to a lack of granular visibility. The new solution provides a unified view across the AI stack, linking hardware health and power consumption directly to specific workloads and teams.
The platform addresses a gap in existing tools, which often provide device metrics without context regarding idle resources or workload failures. By correlating stalled training and inference tasks with underlying GPUs and pods, engineering teams can identify performance bottlenecks in minutes. “While these companies can see their costs climbing, they can’t chargeback GPU spend across business units,” stated Yanbing Li, Chief Product Officer at Datadog. The tool includes forecasting capabilities to help platform teams decide whether to procure new hardware or reallocate existing capacity.
Integration with Datadog’s LLM Observability allows users to trace model latency spikes back to hardware thermals or memory core utilization without switching platforms. Early adopters, such as Hyperbolic, report that the system enables per-device visibility into multi-tenant infrastructure “right out of the box.” The goal of the launch is to turn experimental AI development into production-ready systems by providing the auditability and accountability required to maximize ROI on expensive hardware investments.