The Cloud Playbook, Replayed: Why Owning Your Inference Stack Will Become the Default

Cassie Breviu·March 16, 2026

The Cloud Playbook, Replayed: Why Owning Your Inference Stack Will Become the Default

If you've been in tech long enough, the current conversation around running your own AI models on-device, on-prem, or self-hosted in the cloud, instead of paying per-token for someone else's probably sounds familiar. Swap "model" for "server" and "inference" for "workload," and you're back in 2012 debating whether to move off on-prem.

The parallels aren't surface-level. They're structural. But here's where the analogy breaks in an important way: the cloud transition moved workloads away from hardware you control. The AI model transition is moving them back. And this time economic and technical forces are pulling in the same direction.

The people selling LLM inference never stopped to ask the quintessential question every team learns: does this actually scale? Per-token pricing doesn't scale and you dont need a high powered LLM for every task. They assumed energy and money would magically appear and there would be ROI before they run out of money. We are at a tipping point, but we as builders have the control and it's time we use it.

The opaque capacity and per token charging is margin structure disguised as a platform, and it works exactly as long as customers don't notice they can run capable model tasks on their own, on a laptop, on their own servers, or on cloud compute they control, for a fraction of the cost. The teams who've been doing ML for years know this. Everyone else is noticing now and needs to rethink their ML pipelines and applications.

The Original Migration: On-Prem to Cloud

A decade ago the industry went through a massive shift. Organizations ran everything on their own hardware: their own data centers, their own racks, their own ops teams patching servers. Then cloud providers showed up with a compelling pitch: don't manage infrastructure, just consume it as a service.

The benefits were real:

Scale on demand. No more capacity planning months ahead or buying hardware for peak load that sits idle 90% of the time.
Reduced ops burden. Let someone else worry about physical security, power, cooling, and hardware failures.
Speed to market. Spin up environments in minutes, not quarters.

But the trade-offs were also real:

Cost at scale. Cloud is cheap to start and expensive to stay. Many companies discovered that at steady-state workloads, they were paying multiples of what on-prem cost.
Data sovereignty and compliance. Sending data off-premises raised regulatory questions that hadn't existed before.
Latency and control. Some workloads just needed to be close to the metal.
Vendor lock-in. The deeper you went into a provider's ecosystem, the harder it was to leave.

The cloud transition wasn't a one-way door. Plenty of organizations ended up in hybrid architectures, running sensitive or high-volume workloads on-prem while pushing elastic and experimental workloads to the cloud. The answer was never all-or-nothing, it was about matching the deployment model to the workload.

What Pushed Us to Cloud Is Now Pulling Us Back to Ownership

Now look at the AI model landscape and notice something counterintuitive: the exact benefits that drove cloud adoption are now the arguments for owning your inference stack. Whether that means on-device, on-prem, or running your own models on cloud compute.

For the last few years, the default has been to call credit-based model APIs. Send your prompt to an endpoint, get a response back, pay per token. The benefits mirror early cloud adoption almost exactly:

No infrastructure to manage. You don't need GPUs, you don't need to worry about model serving, you don't need to optimize memory.
Access to the best models. The largest, most capable models are API-only. You get frontier performance without frontier hardware.
Fast iteration. Swap model versions with a config change. No redeployment needed.

But the trade-offs that emerged with cloud infrastructure? They're the same ones now pushing teams toward running their own models:

Cost at scale. Per-token API pricing adds up fast when you're making thousands of inference calls. Running your own model starts looking attractive at volume.
Data privacy. Not every prompt and context data can leave your network. On-device and on-prem keep data entirely in-house. Self-hosted cloud models keep data in your own tenant rather than flowing through a third-party API.
Latency. An on-device model responds in milliseconds without a network round-trip. Even self-hosted cloud models avoid the overhead of shared multi-tenant API infrastructure.
Availability and control. No rate limits, no outages you can't control, no API deprecations that break your workflow overnight. You own the deployment, you own the uptime.
Vendor independence. Running models on infrastructure you control means your stack doesn't depend on a single provider's pricing decisions or terms of service.

The ML teams that had been doing this work before the LLM wave already knew this. Many of them never changed their patterns — they kept training and serving custom models on their own infrastructure and only leveraged credit-based APIs where it made sense. They didn't treat managed AI APIs as a replacement for ownership; they treated them as one option in a toolkit.

The Spectrum of Ownership

The conversation often gets framed as a binary, cloud APIs vs. local, but the real landscape is a spectrum with four distinct deployment models:

Credit-based API calls. You pay per token for someone else's model on someone else's infrastructure. Pricing is opaque, capacity is shared, and every request is metered. This is the easiest on-ramp and the most expensive at scale.
Self-hosted models in the cloud. You deploy a model on your own VMs or containers in the cloud. You control the model, the data stays in your tenant, and you're paying for compute. You get cloud elasticity without the per-token tax.
On-prem. You run models on your own servers in your own data center. Full control over data, hardware, and availability. Higher upfront investment, but predictable costs and no external dependencies.
On-device. Models run directly on laptops, workstations, phones, or edge hardware. Zero network latency, complete data privacy, and no ongoing infrastructure cost beyond the device itself.

Each step down this list trades convenience for control and cost efficiency. With credit-based LLM calls, you're paying for someone else's model on someone else's terms. Self-hosted cloud gives you ownership of the model while keeping the operational flexibility of cloud compute. On-prem and on-device give you full ownership of the entire stack.

For many teams, self-hosted cloud is the practical first step away from credit-based APIs. You don't have to go all the way to on-device to escape the worst trade-offs of a fully managed service. But the direction of travel is clear: the more you own, the better the economics get at scale.

Same Debates, Different Decade, But the Gravity Has Shifted

Even the arguments people have are the same. But unlike the cloud transition, where the economics genuinely favored centralization for most workloads, the economics of AI inference are trending hard toward ownership. Here's why:

Hardware is catching up faster than anyone expected. The gap between what you can run on consumer hardware and what requires a data center is collapsing. Models that needed 80GB A100s two years ago run on your device today. Hardware vendors are racing to make local inference a first-class capability. This isn't a niche trend. It's a silicon roadmap priority across the entire industry.
Model efficiency is improving. Quantization, distillation, and architecture improvements mean that smaller models are getting dramatically better. The "you need the biggest model for every request" argument weakens with every release cycle and was faulty to begin with.
The cost math only goes one direction. Credit-based API pricing is per-token, forever. Whether you run on-device, on-prem, or self-hosted in the cloud, the cost structure improves the moment you stop paying per token. For any workload with predictable, sustained volume owning your inference wins on cost.
Developers want ownership. The developer experience has reached a tipping point across all three self-managed options. Running models on-device is as simple as calling an API. Deploying to your own cloud infrastructure or on-prem servers is well-tooled and well-documented. Developers overwhelmingly prefer tools they control with dependable outcomes.

Credit-based APIs will still matter for frontier research, for tasks that actually require the largest models, for burst capacity, and for teams that don't have the skillset for applied AI. But for an increasing number of workloads, owning the inference stack wins on cost, privacy, latency, and control.

The Playbook You Already Have

If you went through the on-prem to cloud era, you don't need a new framework for thinking about this. You already have one, but you should apply it with the understanding that the center of gravity is different this time:

Own your inference unless you have a reason not to. Whether that means on-device, on-prem, or self-hosted in the cloud depends on your constraints. The old advice was "start in the cloud, pull back to on-prem if you need to." The AI version is inverting: start with ownership, reach for credit-based APIs when you hit a capability ceiling. For most tasks, you won't hit it.
The capability gap is closing faster than you think. Credit-based APIs had a durable advantage in model quality. That gap is closing fast across all deployment models. What you dismiss as "not good enough" today will likely be production-ready next quarter.
Hybrid is the end state. During the cloud migration, hybrid architectures became permanent for many organizations. With AI inference, the hybrid mix will span all four deployment models — on-device, on-prem, self-hosted cloud, and credit-based APIs — each matched to the workload that fits.
Bet on the hardware trajectory. It seems like every chip maker, and new ones, are optimizing for AI inference. That level of industry alignment doesn't happen without a destination. The hardware will get better, faster, and cheaper — making on-device and on-prem more capable every cycle.

History doesn't repeat exactly, but the structural pattern here is hard to miss. The difference is the direction. The cloud transition centralized computing. The AI model transition is distributing it. And this time, the economics and the hardware roadmap are aligned. If you're building AI into your product or workflow today, owning your inference stack (on-device, on-prem, or self-hosted in the cloud) shouldn't be a future consideration. It should be in your architecture now.

The views and opinions expressed in this post are my own and do not reflect those of my employer.