Thinking Machines Lab (TML) has significantly expanded its infrastructure agreement with Google Cloud, becoming one of the first customers to utilize the NVIDIA Blackwell-based A4X Max virtual machines. The new deal centers on the deployment of NVIDIA GB300 NVL72 systems within Google’s AI Hypercomputer architecture. In early bench-marking, TML reported a 2x increase in training and serving speeds compared to previous-generation hardware, a leap critical for the lab’s increasingly complex reinforcement learning workloads.
The partnership moves beyond raw compute, leveraging Google’s integrated AI stack to solve persistent data bottlenecks. TML is utilizing the Jupiter network for near-instantaneous weight transfers, alongside Google Kubernetes Engine for massive-scale orchestration and Spanner for managing transactional metadata. By combining these with a custom node-level caching solution, TML is able to maintain continuous model training while simultaneously serving production-grade workloads at a global scale.
The expanded capacity is specifically designed to support the development of TML’s frontier models and its fine-tuning product, Tinker. Myle Ott, Founding Researcher at TML, noted that the automated remediation provided by Google’s Cluster Director and the reliability of the integrated stack allow the team to focus on high-level research rather than infrastructure maintenance.