Home / BI Tech / OpenAI Explores TPUs to Cut Rising AI Inference Costs

OpenAI Explores TPUs to Cut Rising AI Inference Costs

Jul 2, 2025 Interview

Image credit: Jeremy Waterhouse / Pexels

Tray DorbainBusiness Strategy Consultant

In today’s rapidly evolving AI landscape, the costs associated with running large language models (LLMs) have become a focal point for industry leaders. Chloe Maraina, our Business Intelligence expert, sheds light on how OpenAI is addressing these challenges while exploring new avenues to maintain efficiency. With her robust background in data science and integration, Chloe provides insights into the strategic decisions fueling OpenAI’s growth and sustainability.

Can you elaborate on the reasons why OpenAI started testing Google’s TPUs?

OpenAI’s decision to test Google’s Tensor Processing Units (TPUs) is largely driven by the need to address rising inference costs. While the tests do not indicate an immediate transition, they highlight a proactive approach to exploring how these specialized processors might offer economic efficiency. This move also signifies OpenAI’s willingness to diversify its technological toolkit amidst rising operational expenses.

What challenges or concerns is OpenAI facing with the current inferencing costs?

The inferencing costs have become a significant burden as they consume over half of OpenAI’s compute budget. The high demand and limited availability of Nvidia GPUs exacerbate these financial strains. Thus, tackling these costs is not only about affordability but also about securing enough computational power to continue scaling efficiently.

How does OpenAI plan to manage the growing computational demands of running large language models (LLMs)?

To manage these demands, OpenAI is exploring alternate pathways beyond Nvidia GPUs, such as testing TPUs and considering custom chips like ASICS. These avenues offer possibilities for reducing costs while maintaining or enhancing the performance of their LLMs. By leveraging diverse computing resources, OpenAI aims to optimize operational efficiency and scalability.

Why is there a shift from Nvidia GPUs to alternative AI chips for inference workloads?

The primary drive behind this shift is cost-related. Inference operations have become quite expensive due to the transition from training-focused computing. Alternatives like TPUs are being explored because they can potentially reduce these costs and handle inference workloads more efficiently, providing a better cost-per-query metric.

Could you explain the cost benefits of using TPUs over Nvidia GPUs for OpenAI?

TPUs, especially the older models, offer a significantly lower cost-per-inference than Nvidia GPUs. Even though they may not match the peak performance of the latest Nvidia chips, the architecture of TPUs is optimized to minimize energy waste and idle resources, which makes them a cost-effective solution at scale.

How might OpenAI’s testing of TPUs impact the efficiency of its model operations?

Testing TPUs allows OpenAI to evaluate their potential in fine-tuning model operations. The enhanced energy efficiency and operational cost metrics associated with TPUs could lead to more sustainable model deployments. This exploration may also point to opportunities for optimizing workload distribution across various processor types.

What is the forecast for chip-related capital expenditure in the AI inference sector?

According to projections, capital expenditure in the AI inference sector could reach $120 billion by 2026 and rise beyond $1.1 trillion by 2028. Such a steep curve underlines the growing financial scale of AI infrastructure development, particularly as more companies pivot towards inference-dedicated hardware options to control costs.

Why are LLM providers, including OpenAI, considering custom chips like ASICS?

Custom chips such as ASICs are being considered for their ability to substantially cut inference-related expenses. These chips are tailored for specific tasks, offering efficiencies that generic processors cannot provide. By reducing the cost of running models, LLM providers can increase profitability while maintaining high-performance outputs.

How much of OpenAI’s compute budget is consumed by inference, and why is this significant?

Inference consumes over 50% of OpenAI’s compute budget. This high percentage underscores the necessity for optimizing wherever possible. Lowering inference costs is crucial for sustaining operational viability and enabling OpenAI to allocate more resources towards innovation and model complexity enhancement.

In what ways do older TPUs offer cost-effective solutions for OpenAI at scale?

Older TPUs provide a cost advantage due to their energy-efficient architecture. Although they might not deliver the headline performance of the latest GPUs, their ability to execute at a lower power cost makes them a practical option for large-scale operations. This efficiency is particularly useful when scalability hinges on managing operational costs effectively.

Can you describe the comparative performance benefits of Google’s TPUs as stated by the analysts?

Analysts like Alexander Harrowell highlight that TPUs often achieve a better ratio of floating-point operations per second to their theoretical maximum compared to alternatives. This means they can perform computational tasks closer to their full potential, offering considerable advantages in terms of performance efficiency.

What insights can you provide about the lifespan and market longevity of AI chips?

AI chips generally have a longer market presence than anticipated. Despite rapid technological advancements, older models like Google’s TPU v2 or Nvidia’s A100 series are still viable and continue to be sold. Their enduring presence highlights their adequate performance and cost benefits for a range of applications.

What generations of TPUs are currently available for purchase through Google Cloud Platform?

Google Cloud Platform offers TPU versions from v2 to v6, now branded as Trillium. Each generation presents varying specifications tailored for different performance and efficiency needs. Within these, v5 offers sub-variants focused on optimizing either performance or efficiency, indicating a flexible resource selection for users.

What information do you have about Google’s plans for the Ironwood TPU?

Google’s preview of the Ironwood TPU in April suggested it has superior cost-performance capabilities compared to its predecessors and competing products from Nvidia, AMD, and others. However, widespread availability remains limited, indicating that these might still be in the early stages of deployment.

How does diversifying chip suppliers help OpenAI avoid potential bottlenecks?

Diversifying chip suppliers enables OpenAI to minimize risks related to supply shortages, like those experienced with GPUs. This strategy provides greater leverage in pricing negotiations and ensures that computational demands can consistently be met, allowing OpenAI to sustain its growth and operation uninterrupted.

Who are OpenAI’s current chip suppliers, and are there any others being considered?

OpenAI’s current chip suppliers include Microsoft, Oracle, and CoreWeave. Additionally, it is considering custom silicon solutions like AWS’s Tranium and Microsoft’s Maia. These options could potentially enhance performance metrics and adaptability to various workload demands while optimizing cost efficiency.

How might custom silicon options like AWS Tranium and Microsoft Maia benefit OpenAI?

These custom solutions offer several potential benefits, including tailoring to the specifics of AI acceleration and inference workloads. By adopting such technologies, OpenAI can harness more control over performance specifications and cost, aligning them closely with their operational goals and infrastructure needs.

What might be the focus of a potential special deal between OpenAI and Google regarding TPUs?

There’s a possibility that OpenAI might negotiate with Google to use TPUs for more internal functions, such as model testing and employee training. This arrangement could provide OpenAI with an innovative edge without committing large-scale deployment resources to untested technologies.

What is your forecast for chip-related capital expenditure in the AI inference sector?

I foresee further increasing investments in AI chips, driven by demand for more specialized hardware and the ongoing focus on reducing operational costs. As the AI field continues to scale and diversify, strategic investments in infrastructure will be critical to maintain competitive advantage and technological leadership.

OpenAI Explores TPUs to Cut Rising AI Inference Costs

Related Publications

Subscribe to our weekly news digest.