AI for the CFO: variable cost, vendor sprawl, and the playbook that ends the surprise bill
Uber reportedly provisioned 5,000 engineers with Claude Code in December 2025. By April 2026 it had burned the entire annual AI budget in four months. Variable cost is the new shape of the AI line item, and the CFOs who are getting it under control are running the same four-lever playbook. Here is what is on each lever, which platforms are credible, and what to ship in the next ninety days.
In December 2025, Uber reportedly provisioned 5,000 engineers with Anthropic's Claude Code coding assistant. By April 2026, public reports surfaced that the company had burned through its entire annual AI budget in four months, driven by the recursive token consumption pattern of autonomous coding agents. Whether that exact figure holds or not, the shape of the problem is real. Variable cost has arrived inside the AI line item, and it has caught most finance functions with a per-seat budgeting mental model that no longer survives contact with how the tools get used in practice. This piece maps the four levers a CFO can pull, the platforms that sit on each, the pricing realities as of mid-2026, and a ninety-day playbook the most disciplined finance teams are running.
Why variable cost is the new shape of software P&L
Software was, until recently, the easiest line item on a CFO's P&L to forecast. You bought a SaaS subscription, you paid per seat, the bill barely moved. The marginal cost of one more employee using Salesforce was effectively zero. That assumption powered a decade of enterprise software budgeting.
That assumption is now wrong for any company doing serious AI work. The economist Wei Sun at SemiAnalysis puts the shift cleanly in a recent conversation with Val Bercovici : "software is going from minimal marginal costs to relatively high marginal costs". Every interaction with a large language model consumes compute that the provider has to be paid for, and the provider can charge nearly any price they want for it depending on how fast they make the user wait and how much hardware they dedicate. The CFO inherits the bill.
Two facts about that bill are worth getting right before any platform conversation starts.
The first is that agentic workflows consume between five and thirty times more tokens than human chat workflows for the same task. A human asking a model a question typically uses around 2,000 tokens of inference. An autonomous coding agent doing the same task fetches code, runs tests, evaluates the output, rewrites, and iterates. The same task ships at 10,000 to 60,000 tokens once the agent is wired up. Most 2025 budget models for AI assumed the human-chat ratio. The bill that landed in Q1 2026 did not.
The second is that the input-to-output token ratio has inverted. Chat applications produced roughly two input tokens for every output token (the user typed a paragraph, the model wrote a paragraph back). Agentic workloads produce ratios of 50:1, 100:1, sometimes 1000:1, because the agent is reading huge volumes of context to make small focused decisions. The implication: if your finance team is anchored on the OUTPUT token cost (the visible number in most pricing pages, the one that looks expensive), you are looking at the wrong line. Input is now where the spend lives.
These two shifts together explain why the AI line item keeps coming in higher than the pilot suggested. Per Deloitte's April 2026 AI Token Economics for CFOs brief, 80 to 85 percent of enterprises missed their AI forecasts by more than 25 percent this year. The pattern is not random. It is the predictable result of an old mental model running into new economics.
The four levers a CFO can pull
CFOs who are getting ahead of this have, in my work with finance teams across UK and European mid-market over the last eighteen months, settled on four distinct control surfaces. Each does a specific thing. Each has its own buying decision. You will need a position on all four.
- Native admin consoles - the budget controls the AI provider gives you out of the box. Free with the contract. Capable but limited.
- The gateway layer - third-party platforms that sit in the data path between your developers and the model providers. Enforce hard budget caps in real time. The "kill switch" tier.
- Routers and open-weight models - the architecture decision about which model handles which class of query. Where the largest absolute savings come from.
- Company-wide AI policy and procurement - the procurement consolidation decision that ends the expense-claim sprawl and turns AI into a managed line item with audit trails.
The rest of this piece is one section per lever, in the order a CFO would typically meet them.
Lever one: what native admin consoles do (and where they stop)
Before buying any third-party tool, know what the AI providers already give you.
OpenAI Enterprise / API Admin organises API keys into Projects and lets a finance team set a monthly USD budget per project. When the budget hits, calls get rejected. This is the most basic, most useful form of cost control, and most CFOs are not yet using it. The limitation: spend is tracked per project or per key, not per end user, so chargeback to a specific developer or feature is impossible without an external layer.
Anthropic Console for Teams provides workspace spend limits and alerts. The 2026 updates tightened SDK usage controls. The same limitation applies: one number per API key, no native answer to "which feature inside the product is driving the spend".
Google Vertex AI integrates directly with Google Cloud Billing and IAM. Budget alerts, hard caps via Google Cloud Quotas, and a useful lever that the OpenAI and Anthropic consoles do not offer: Provisioned Throughput, which lets a finance team commit to dedicated capacity at a predictable monthly cost and use HTTP headers to deny overage requests. This is the closest thing the AI providers offer to the predictable-bill experience a CFO is used to. The catch is that Vertex AI's pricing is itself complex, blending compute node hours with token costs, and so the predictability comes at the cost of comprehensibility.
The honest summary: native consoles will catch the obvious runaway projects. They will not give you per-user attribution, semantic caching, cross-provider observability, or feature-level chargeback. I tell every finance team I work with to treat the native consoles as the first thing that gets configured, not the last. They are necessary and not sufficient. Configure them on day one. Then buy a gateway.
Lever two: the gateway layer - where real-time enforcement lives
The category of tools sitting in the data path between your application code and the model providers has, in 2026, hardened into the most important governance investment a serious AI deployment makes. The pattern is the same across vendors: every request flows through the gateway, which logs it, caches what it can, and refuses to forward calls that breach a budget rule. This is the only place a budget rule can be enforced in real time. Native consoles alert; gateways act.
Portkey is the most credible enterprise-grade option as of mid-2026. It provides cost-based or token-based budget limits per workspace or API key, with daily and weekly resets, semantic caching that returns saved responses for repeated queries (a 90 percent reduction on token spend for repeated FAQ-style workloads is realistic), and the full enterprise stack: SSO, role-based access control, audit trails. Pricing starts at around $99/month for Pro tier and scales to $500-$5,000+ per month at enterprise log volume. The 2026 case studies that surface in public include Qoala, the Indonesian insurtech, where CTO Prateek Jogani uses Portkey to manage more than 25 distinct GenAI use cases with per-use-case cost tracking, and Haptik, the conversational AI provider, where CTO Swapan Rajdev deployed Portkey in production specifically because the native reporting in OpenAI and Azure did not meet enterprise needs.
Helicone and Langfuse are the open-source-derived alternatives. Helicone leans into unit economics, grouping API calls into sessions so a finance team can see that "an automated customer support resolution costs the company $0.12 on average". Pro tier sits around $60/month. Langfuse pairs cost data with quality evaluation, which matters because it lets engineering see whether the expensive Opus model is outperforming the cheap Haiku model on a specific task. Core tier $29/month, Pro $199/month, Enterprise $2,499/month with unlimited seats across all paid plans. Both are observability platforms by default; budget enforcement requires routing through their gateway component rather than the SDK-only tracing.
LiteLLM is the open-source routing proxy that powers a lot of self-built solutions. Excellent for engineering teams who want full control. Real maintenance overhead at enterprise scale. Reasonable for an engineering organisation that wants ownership; not the right answer for a finance function that wants a vendor on the line when something breaks.
Datadog LLM Observability extends the existing Datadog APM into generative AI. If your company already pays for Datadog, the marginal cost of adding LLM observability is low and the data lands in the dashboard your operations team already watches. The trade-off is that Datadog is configured for general infrastructure monitoring, not LLM-specific budget enforcement; turning alerts into actual traffic blocking requires real engineering work.
The right CFO move on this lever is to buy one gateway and put every developer behind it on day one. I would not run a serious AI engagement now without one in place. The 90-percent-of-spend-now-visible benefit alone justifies the cost.
Lever three: routers and open-weight models - where the real money is
This is the lever that produces the largest absolute savings, and it is the one most CFOs have not yet considered. The frame is simple: the proprietary frontier models (Claude Opus 4.6, GPT-5, Gemini 3.1 Pro) cost between $5 and $15 per million tokens. Smaller open-weight models hosted on inference providers like Together AI or Fireworks cost between $0.05 and $0.10 per million tokens. That is a 100x cost differential for the same unit of token throughput.
Almost every enterprise workload contains a long tail of routine, simple queries that the cheap models handle perfectly well, and a smaller head of complex queries that need the frontier model's reasoning. The architectural decision to route the easy work to the cheap models and reserve the expensive models for the difficult work is called task classification routing, and it is the highest-leverage cost decision in the entire stack.
OpenRouter is the most prominent player in the routing space. As of mid-2026 it processes more than 100 trillion tokens per month and raised a $113M Series B in late 2025. It charges almost no markup on token pricing and monetises through a roughly 5 percent fee on account credit purchases. For small and mid-sized companies and for R&D exploration, the OpenRouter convenience savings are real: one API key replaces five, and you can switch models without rewriting application code. For a true enterprise pushing billions of tokens a month, OpenRouter's enterprise value is in workflow routing rather than unit pricing. A negotiated direct contract with Anthropic or OpenAI, or cloud-provisioned throughput on AWS Bedrock or Vertex AI, will produce a better per-token rate at that scale.
The other credible aggregators worth knowing by name:
- AWS Bedrock and Azure AI Model Catalog are the enterprise-grade choices because the bill rolls into existing cloud commitments (Enterprise Discount Programs, Microsoft Azure Consumption Commitments) and you inherit the cloud provider's SOC 2 and data residency guarantees.
- Together AI and Fireworks AI are inference providers that run open-weight models at scale on dedicated infrastructure. This is where the 100x cost differential lands for production workloads.
- Vercel AI Gateway is the right choice if your front-end stack already lives on Vercel; an awkward fit otherwise.
The interesting question is not which router to buy. It is which workloads to migrate to cheaper models. Three named operators have published numbers worth citing in board papers:
- LinkedIn built a family of EON models on top of Llama 3 for internal and user-facing workflows. Reported costs were 75 times lower than GPT-4 and 6 times lower than GPT-4o, with higher domain-specific accuracy through fine-tuning.
- Convirza, an enterprise speech analytics provider, used Llama-3-8B with specialised adapters for distinct performance indicators. Reported a 10x cost reduction versus OpenAI and an 8 percentage-point accuracy improvement on the same task.
- Uber integrated Llama and Mixtral inside its Ray infrastructure, drastically reducing the per-output cost of generative work across the platform.
The CFO frame on this is unromantic. The right model for any given enterprise workload is the cheapest model that hits the quality bar. The cheapest model is almost never the frontier model. When I look at where the biggest savings are sitting unclaimed in the finance teams I work with, it is almost always here: high-volume routine work still being sent to Opus or GPT-5 out of habit, when a fine-tuned Llama or a Gemini Flash would do the same job at one percent of the cost. A serious AI engineering team in 2026 owns this decision as a quality-and-cost optimisation, not as a vendor-loyalty bet.
Lever four: the company-wide policy that ends the expense-claim era
The pattern most mid-market CFOs are still living with is the worst possible operating model for AI spend. Individual developers expense ChatGPT, Claude, Cursor, GitHub Copilot, Replit, v0, and Lovable independently. The finance team sees a fragmented forest of $20-per-month line items with no central visibility, no SSO, no audit trail, and no way to tell who is using what for what.
The 2026 fix has a name: AI Prime consolidation. The pattern: sign one enterprise contract with a primary AI provider (Anthropic for Teams, OpenAI Enterprise, or Microsoft 365 Copilot depending on the workforce mix), get SSO, get SOC 2 compliance, get zero-data-retention guarantees, get audit logs, get per-team cost allocation. Stop the expense-claim sprawl. Set a clear approved-tools list for the developer-specific layer (Cursor for IDE work, GitHub Copilot for code completion, Claude Code for agentic coding tasks), each on a per-team budget.
Mid-2026 pricing benchmarks, from public sources and aggregate spend-tracking data:
| Tool | Tier | Typical effective price |
|---|---|---|
| OpenAI Enterprise | Custom contract, 150-seat minimum | $60-$90 per user/month effective; large enterprise contracts average ~$561k/year |
| Microsoft 365 Copilot | SMB Business | $21 per user/month |
| Microsoft 365 Copilot | Enterprise | $30 per user/month (plus pay-as-you-go "Copilot Credits" for agent workflows) |
| Anthropic for Teams | Team plan | ~$30 per user/month (Claude Code integrated) |
| Cursor for Teams | Per-seat | ~$40 per user/month |
| GitHub Copilot Business | Per-seat | $19 per user/month |
These are list prices and benchmark averages. True enterprise prices on the OpenAI and Anthropic tiers remain under NDA, and a specific CFO's reality depends entirely on negotiated volume commitments. Treat the numbers above as a planning anchor, not a quote.
The hardest governance problem in this lever is shadow AI: the engineer who pulls out a personal credit card to subscribe to a tool the company has not approved because it makes them faster. I have watched this pattern play out across half a dozen engagements, and the right answer is never enforcement theatre. The right answer is a fast, lightweight approval path that lets a developer make the case for a new tool, get a per-team budget allocation if the case is good, and not feel like the finance function is treating them as the enemy. The CFO playbook here is to be the team that says yes quickly to the right tools, not the team that says no slowly to everything.
Beyond the seat-based licences, the wider hidden costs that catch CFOs out:
- Inference cost at production scale typically dominates the line item at 80-90 percent of system lifetime cost.
- Vector database and RAG infrastructure add real recurring cost. Pinecone, Weaviate, pgvector hosting all need budgeting.
- AI Admin talent is the new specialist role most finance teams have not budgeted for. Expect at least one full-time hire for any company past pilot stage.
- Security audits and compliance reviews for AI deployments are now their own line item, particularly under the EU AI Act's August 2026 enforcement deadline.
Deloitte's CFO surveys land an uncomfortable number on this: less than 1 percent of executives report a 20 percent or higher ROI on their generative AI investment. The bulk are paying for AI without yet capturing the value. The CFO who closes that gap captures the budget for the next eighteen months of AI deployment. The CFO who does not gets a finance review they would rather not be sitting in.
The 90-day CFO playbook
If you have the four levers in place at the right level of seriousness, you remove the surprise from the AI line item. The shape of the playbook a tech-leading CFO is running in 2026:
Days 1 to 30. Audit current spend. Sum every AI line across credit cards, corporate cards, expense claims, and direct API invoices. Build a single view of total AI spend by team. Most finance functions doing this for the first time discover their real spend is 2 to 3 times higher than the line item they were tracking. Identify the top three spenders and have a conversation about what they are using and why.
Days 31 to 60. Sign one prime-vendor enterprise contract. Deploy a gateway (Portkey, Helicone, or LiteLLM behind an engineering manager who owns it) in front of every developer. Set per-team monthly budgets at the gateway level. Configure native console budget limits on every direct API key. Publish an approved-tools list for the developer layer.
Days 61 to 90. Run one workload through task classification. Pick a high-volume, low-complexity workload (customer support tier-one triage, internal document summarisation, FAQ matching) and migrate it to a cheap open-weight model on Together AI or Fireworks. Measure the cost-per-resolution before and after. I have seen this single migration return between 60 and 90 percent on a workload's monthly cost without any visible change to the user experience. Use it as the proof of concept for the next migration. Set up your monthly AI finance review with the line items mapped to business outcomes, not to token counts.
Three months of disciplined work changes the conversation from "why did the AI line item go up again" to "we know exactly where the AI line item is, and here is the next workload we are moving to cheaper infrastructure".
Where this is heading: agentic FinOps
Gartner has given this discipline a name: Agentic FinOps. The 2026 framing is that agentic AI workloads have decoupled cost from human usage. A 10-person engineering team can drive token consumption equivalent to a 1,000-person team running pre-agentic chat. The traditional FinOps tooling, built for cloud-infrastructure cost optimisation, does not yet handle the recursive consumption pattern of agents talking to other agents. The 2026 tooling stack (CloudZero AI, the Portkey-plus-Langfuse pairing, the dedicated agentic-FinOps offerings) is what fills that gap.
The CFO who treats AI cost as a tractable problem with a known playbook, rather than as a mysterious variable to be feared, captures the credibility to keep funding the AI initiatives that are working. The CFO who waits for the next surprise gets pulled into a board conversation about why the AI line item exists at all. Of the two postures, in my experience the first is the only one that survives 2027.
This piece sits alongside AI for the managing director and the CEO , which makes the executive-level case for sponsoring serious AI work. It pairs with what RAG is and what it could do in your company on the architecture side, and with the boring discipline that stops hallucination being a problem on the quality side. Each one is a different lever on the same underlying transition: from AI as exciting demo to AI as managed infrastructure inside a real P&L.
For a structured pass through your company's current AI spend, vendor sprawl, and governance gaps, the Houtini AI Audit is the engagement that surfaces what the playbook would look like for your specific situation. The Audit produces a brief the senior team can sign off on, and the implementation engagement that follows installs the discipline alongside whatever tools you choose.
Educational only, not financial advice. Pricing benchmarks compiled from public sources and aggregate spend-tracking data as of mid-2026; specific enterprise contracts may differ. The Uber spend story is reported in public sources; we have not independently verified the exact figures.
Continue reading.
AI hallucination, and the boring discipline that stops it being a problem
You have read about lawyers citing made-up cases and chatbots inventing refund policies. The fix is not a new platform or a clever prompt. It is the most boring discipline in software: read what came out and check the bits that matter against a second source. Here is what hallucination is, how we run a Houtini-grade check on every claim, and what would change if your team did the same.
What is RAG, and what could it do in your company?
Your AI confidently answers questions about your business with public-internet knowledge. RAG is the architecture that gets your actual contracts, customer list and operating playbook into the conversation, with citations. Here is what it is, where it sits next to long context and MCP, and what it changes if you sponsor the work this quarter.
The Bitcoin Spiral: How I Built a Live AI-Narrated Dashboard on GitHub Pages
I spent an afternoon building a live Bitcoin dashboard on GitHub Pages with no backend, no paid API key, and a frontier model writing the verdict at the top of the page every six hours. Here's how it works, what it cost (nothing), how to stop the model inventing numbers, and where the architecture goes next when Chrome ships in-browser inference.