shaping intuition
← Back to all posts

Future shape of the technology industry

May 7, 2026

4,977 words · ~7,925 tokens

These are my current notes — what I'm learning and trying to understand about what's driving this shift.

Contents

The thesis

Frontier agentic experiences — Opus 4.6 / 5T class running in real agent loops — cost $100–$1,000 a day today, and are heading to $10,000 a day for the most demanding workloads. Anthropic recently had to push their heaviest subscribers off unlimited plans because the unit economics didn't hold. On the raw API, OpenClaw at any meaningful intensity is unusable for a normal human; you'd burn a month's rent in a weekend.

The defining question for the next phase of the industry is how to drive that down to $20 a month. Whoever solves it owns the consumer surface.

Four levers do the work:

  1. Model compression — distillation into smaller models (Gemma 4 31B class). Mostly research. No information-density limit visible yet, but the ceiling is the teacher: you can only distill what the frontier knows, so frontier progress is the precondition. Expect 6–12 months for current frontier capability to land compressed.
  2. Local inference — when a capable model fits on consumer hardware, marginal token cost collapses to electricity. The hardware curve (unified memory bandwidth, NPU acceleration, cheaper silicon) pushes runnable model size up independently of compression. Two curves compounding.
  3. Efficient harness — engineering, not research. Caching, smart routing (cheap model for easy subtasks, frontier only for hard ones), better tool orchestration. Claude Code / OpenClaw shape.
  4. Memory and continual learning — the dominant cost in long-horizon agentic work is context. Persistent memory, episodic recall, and continual learning let an agent stop re-deriving what it already knows.

Model compression

Same capability, smaller compute budget. The most direct lever on the cost curve, and the most active research area.

The active toolkit is distillation, quantization, sparse architectures, and hybrid attention. Distillation takes a frontier teacher and trains a smaller student to mimic it on synthetic curricula — the Gemma family is the cleanest public example, with Google's open-weight models hitting performance points that would have been frontier-class two years ago. Quantization shrinks each parameter from 16 bits to 4 (or now 2, or 1.58 in BitNet's case), with minimal quality loss — a 4–10x compression on memory footprint alone. Sparse architectures (MoE, mixture-of-depths, early-exit) make a model "big" in capacity but cheap in compute by activating only a fraction of weights per token.

Hybrid attention is the newer thread. Pure transformers pay quadratic cost in sequence length; state-space models like Mamba and architectures like Jamba mix transformer attention with linear-time alternatives. This matters more every year as agentic context windows blow past 100K tokens and the quadratic term starts to dominate.

Each generation of compression has historically delivered ~10x on either memory or compute. Compounded across the next 18-24 months, this is the first major contributor to the 1000x.

The precondition is frontier supply. Distillation is downstream of frontier progress — you can only compress what the frontier knows. If GPT-6, Claude 5, Gemini 4 stop adding meaningful capability, the cheap tier hits a ceiling 12-18 months later. So far the frontier keeps moving, and distillation has something to chase.

Local inference

When the model runs on the user's device, marginal token cost collapses to electricity. Once you've bought the chip, output is essentially free. This is the single largest economic lever once it's reachable — and what makes it reachable is two curves compounding: compression making capable models smaller, and consumer hardware making consumer silicon more capable.

Apple Silicon is the wildcard on the hardware side. The M-series uses unified memory — CPU and GPU share RAM directly with no PCIe transfer — and the high-end SKUs ship with up to 192 GB at ~800 GB/s. Enough to run 70B models comfortably, 100B+ with quantization, on a desktop machine that also runs your IDE and Slack. The trajectory line has 100B+ params on a laptop within 3-4 years.

NPUs are becoming standard everywhere else: Apple Neural Engine, Qualcomm Snapdragon X Elite, Intel Lunar Lake. They trade peak FLOPS for 10–100x better perf-per-watt on inference workloads. Not great for training, but training is a once-per-model event; inference is forever. Sustained-inference thermal envelopes — not just burst capacity — become the relevant spec.

The deeper unlock is memory bandwidth. LLM inference is memory-bound; the GPU spends most of its time waiting to read weights from RAM. HBM, which has been a datacenter-only luxury, is starting to appear in consumer parts. When that lands at scale, the laptop becomes a viable platform for what previously required a $30K H100 box.

The software side is mature enough to take for granted. llama.cpp runs on basically anything; Ollama wraps it for one-line model installation; MLX takes advantage of Apple unified memory; ExLlamaV2 squeezes the most out of consumer NVIDIA. The local serving stack is no longer the bottleneck — hardware capacity is.

Once 70B-class models run at 30 tokens/sec on $2,000 hardware, cost-per-magical-experience drops by another order of magnitude — not through software, but because the user already paid for the chip.

Efficient harness

The loop that wraps the model with tools, memory, and control flow. Engineering, not research — and most of it has barely been built. Invisible from the user's perspective, but where the largest cost variance hides.

A great harness reuses context, caches tool schemas, splits planning from execution, handles errors locally, and avoids re-deriving information already established. A bad harness re-sends entire conversation history every step, retries with full context after every failure, and asks the model to re-derive its plan on each turn. Same model. Same task. The bad harness can spend 10x more tokens to reach the same answer — sometimes more.

Smart routing is the harness's other major lever. Most queries are easy. If 95% of what you ask of a personal AI can be handled by a small local model — drafting, summarizing, simple coding, reformatting, lookup — and only the remaining 5% need a frontier model, the cloud bill drops by 20x without losing magical capability when it matters. Apple Intelligence and Gemini Nano are the early production templates: small on-device model handles most requests, hard ones escalate to cloud. Today's routing is coarse — a "simple or hard" classifier — but the next wave is fine-grained per-token routing, dynamic capability detection, and personal models that learn each user's task distribution.

The next layer is OS-level primitives — system-provided tool use, file access, and memory APIs that every application doesn't have to rebuild — and standardized cheap pipelines for the routine 80% of operations (summarize, draft, search, schedule). When those exist, frontier inference is reserved for the genuinely novel 20%.

This is where the leverage is right now. A 7B model with a great harness and smart routing beats a frontier model running stateless on cost-per-completed-task, often by an order of magnitude. The $20/month economics don't pencil without it.

Memory and continual learning

The dominant cost in long-horizon agentic work is context. Every session re-establishes state, re-reads files, re-derives plans the agent had already worked out yesterday. The agent pays full token cost for what a human would simply remember.

Persistent memory is the most concrete part of the answer. Tools like Claude Code's CLAUDE.md, Anthropic's memory tool, and MCP memory servers are early but real templates: agents that record what they've learned and read it back at the start of the next session. With persistent memory, agents start cheap and stay cheap. Without it, every session pays the full token tax of context establishment.

The deeper structure splits memory into episodic — "what happened in session 47" — and semantic — "what does the user generally prefer." These have different storage and recall mechanisms. Vector databases handle semantic similarity well; episodic memory is closer to a structured log with timestamps and causal links. Building both right is genuinely unsolved, especially at the consumer scale of millions of personal AIs each accumulating their own history.

Continual learning is the harder, longer-horizon problem. Frontier models are mostly frozen — once trained, they don't update. An agent that can incorporate yesterday's learnings into tomorrow's behavior, without a full retrain, unlocks a different cost curve entirely. Today's "memory" tools paper over this by stuffing learned facts into context. True continual learning would mean updating model weights cheaply, on the user's own hardware, from their own data. A handful of research threads (LoRA-on-the-fly, online fine-tuning, expert insertion in MoE models) are circling this; nothing is shippable yet.

This is the least mature of the four levers and probably the highest-leverage. Without it, every personal-AI session pays full setup cost; with it, cost-per-magical-moment drops another order of magnitude beyond what compression and hardware alone deliver.

The bottleneck right now

If I had to name the current bottleneck: it's not model size, and it's not raw hardware. It's harness efficiency and memory. The first is engineering work that's barely been built; the second is the hard problem nobody's fully cracked. A 7B model with a great harness and persistent memory beats a frontier model running stateless on cost-per-completed-task, often by an order of magnitude. The $20/month path runs through software at least as much as through silicon.


Appendix

The rest of this is reference material — the technical context behind the moving pieces above. Skim or skip depending on how deep you want to go.

Pricing of tokens today

Here's how Opus 4.6 pricing breaks down on the Anthropic API.

Base rates (per million tokens)

  • Input: $5.00
  • Output: $25.00

So output costs 5x input — a meaningful asymmetry when you're running agents that produce lots of code or tool calls. A run that consumes 500K input + 100K output tokens would be roughly $2.50 + $2.50 = $5.

Modifiers worth knowing

  1. 1M token context at standard pricing. Opus 4.6 includes the full 1M token context window at standard pricing — earlier reporting suggested a premium tier above 200K, but the current docs price the whole window flat. This matters a lot for agentic workflows where long sessions accumulate context.
  2. Prompt caching. Cache reads are billed at 0.1x the base input rate (a 90% discount on cached input), with cache writes priced separately at a small premium. For agent loops with stable system prompts, tool schemas, or long conversation history, this is the single biggest cost lever.
  3. Batch API. 50% discount on input and output for asynchronous workloads — useful for evaluations, backfills, or anything that can tolerate a delayed SLA. Stacks with caching.
  4. Fast mode (beta). Fast mode for Opus 4.6 provides significantly faster output at 6x standard rates, so $30/M input and $150/M output. Only worth it for latency-sensitive interactive use.

One caveat

There's now a newer Opus 4.7, which Anthropic released as the successor to 4.6. Opus 4.7 keeps the same sticker price as Opus 4.6 ($5/$25), but ships with a new tokenizer that can produce up to 35% more tokens for the same input text — so headline price is identical but effective per-request cost can rise. Worth replaying real traffic before migrating if you're optimizing spend.

Parameter

A neural network is essentially a giant function with lots of tunable knobs. Each knob is a number — a parameter — that gets adjusted during training. The two main kinds are weights (which scale connections between neurons) and biases (which shift outputs). When you hear "GPT-3 has 175 billion parameters," that's the count of all those knobs added together.

You can think of parameters as the model's capacity to store learned patterns. More parameters means more room to encode relationships, facts, and reasoning patterns from training data — but also more compute and memory to use the model.

Counting and naming conventions

  • M = million (e.g., BERT-base is ~110M)
  • B = billion (e.g., LLaMA 7B, 13B, 70B)
  • T = trillion (rumored size of some frontier models)

A "7B model" has about 7 billion parameters. The number is often baked right into the model's name.

Why size matters: memory

Each parameter is just a number stored in memory. The size depends on the precision:

  • FP32 (32-bit float): 4 bytes per parameter
  • FP16 / BF16 (16-bit): 2 bytes per parameter
  • INT8 (quantized): 1 byte per parameter
  • INT4 (heavily quantized): 0.5 bytes per parameter

So a 7B model in FP16 needs roughly 14 GB just to hold the weights — before you account for activations, gradients, or optimizer state during training. This is why GPU VRAM is the main bottleneck for running large models locally. A 70B model in FP16 needs ~140 GB, which is why people quantize aggressively to fit on consumer hardware.

Why size matters: capability (scaling laws)

Researchers (notably at OpenAI and DeepMind) found that model performance improves predictably as you scale up parameters, training data, and compute together. This is the famous "scaling laws" result. Doubling parameters tends to give a smooth, measurable improvement in things like prediction loss — no sudden jumps, just steady gains.

The Chinchilla paper (DeepMind, 2022) refined this: for a given compute budget, you should roughly balance parameter count with training tokens. Earlier models like GPT-3 were probably undertrained for their size — you'd get more bang for your buck with a smaller model trained on more data.

Rough size tiers (for language models)

  • Small (< 1B): Run easily on laptops or phones. Good for narrow tasks, classification, embeddings. Examples: DistilBERT, small Llama variants.
  • Medium (1B–10B): Run on a single consumer GPU. Surprisingly capable for chat and coding. Examples: Llama 3 8B, Mistral 7B.
  • Large (10B–100B): Need server-grade GPUs or quantization tricks. General-purpose assistants. Examples: Llama 3 70B.
  • Frontier (100B+): Run only in datacenters. The top-tier models from Anthropic, OpenAI, Google, etc.

Tradeoffs to keep in mind

  • Bigger isn't automatically better. Larger models cost more to train (sometimes tens of millions of dollars) and more to serve (every token generated costs more compute). They also don't necessarily beat smaller, well-trained, well-tuned models on every task — a fine-tuned 7B model can outperform a generic 70B model in its specialty. And for many production uses, latency and cost favor smaller models.
  • There's also a difference between total parameters and active parameters in mixture-of-experts (MoE) models. An MoE might have 200B total parameters but only activate ~20B per token, giving you the quality of a big model with the cost closer to a smaller one.

Mixture of Experts

The core idea

In a regular ("dense") transformer, every parameter is used for every token. If you have a 70B model, all 70B parameters fire on every word you process. This is expensive.

MoE asks: what if we had many specialized sub-networks ("experts"), and for each token, we only used a small handful of them? You get the knowledge capacity of a huge model, but the compute cost of a much smaller one.

Where the experts live

In a standard transformer, each layer has two main components: a self-attention block and a feed-forward network (FFN). The FFN is where most of the parameters live — typically about two-thirds of the model's weights.

MoE replaces that single FFN with many FFNs in parallel — say, 8, 32, or 128 of them. These are the "experts." The attention layers are usually left alone (shared across all tokens). So when people say "MoE model," they almost always mean MoE in the feed-forward layers.

The router

Sitting in front of the experts is a small neural network called the router (or gating network). For each token, the router looks at it and decides: "which experts should handle this one?"

Typically the router picks the top-k experts — most commonly top-2. So out of 8 experts, only 2 actually run for any given token. The token gets sent to those 2 experts, their outputs get combined (weighted by the router's confidence), and that becomes the layer's output.

This is what's called sparse activation: most experts are idle for any given token.

Total vs active parameters

This is the key number to understand with MoE. Take Mixtral 8x7B as a classic example:

  • 8 experts, each roughly 7B-ish in the FFN portion
  • Total parameters: ~47B (not 56B, because attention layers are shared)
  • Active parameters per token: ~13B (2 experts fire, plus shared attention)

So Mixtral 8x7B has the memory footprint of a 47B model — you need to keep all experts loaded in VRAM because you don't know in advance which ones the router will pick — but the compute cost of a 13B model. That's the magic trick: cheap inference, big knowledge capacity.

DeepSeek-V3 takes this further: 671B total parameters, but only ~37B active per token.

What do experts actually specialize in?

This is where intuition often misleads people. You might imagine "expert 1 = math, expert 2 = poetry, expert 3 = code." In practice, the specializations are much weirder and more low-level. Researchers have found experts specializing in things like:

  • Specific punctuation patterns
  • Certain syntactic structures
  • Particular languages or scripts
  • Numbers vs. words

The specializations emerge from training rather than being designed. They're often not human-interpretable.

The hard part: load balancing

A naive MoE has a serious problem: the router might learn to always send tokens to the same one or two favorite experts, leaving the rest untrained and useless. This is called expert collapse.

To prevent it, training adds an auxiliary loss that penalizes uneven expert usage, encouraging the router to spread tokens out. Getting this balance right — without hurting model quality — is one of the trickier engineering problems in MoE training. DeepSeek-V3 introduced an "auxiliary-loss-free" approach that uses a bias term instead.

Tradeoffs

The wins:

  • Much cheaper inference per token vs. a dense model of equivalent quality
  • Much cheaper training compute for the same capability
  • Scales gracefully — adding more experts adds capacity without adding much per-token compute

The costs:

  • Memory doesn't shrink. You need VRAM for all experts even though only some run.
  • More complex to train (load balancing, routing instability)
  • Harder to serve efficiently — different tokens in a batch want different experts, which makes GPU utilization tricky
  • Communication overhead when experts are spread across multiple GPUs

Why it matters now

Most frontier models from the last couple of years are believed to be MoE: GPT-4, Gemini, Claude (probably), DeepSeek-V3, Mixtral, Grok. The dense-vs-sparse tradeoff has shifted hard toward sparse for the largest models, because at frontier scale the inference cost savings are enormous.

For smaller models (under ~10B), dense is usually still preferred — the MoE complexity isn't worth it when the model is small enough to be cheap anyway.


Training software

The software stack for training is a fundamentally different problem from serving. Training is bottlenecked by memory across many GPUs at once, and by the time it takes to synchronize gradients between them.

Frameworks

  • PyTorch — Dominant. Most research and most production training happens here.
  • JAX — Google's functional framework. Used for Gemini. Strong composition story (pmap, vmap, jit).
  • TensorFlow — Mostly legacy at this point for new LLM work.

Distributed training systems

  • DeepSpeed (Microsoft) — ZeRO optimizer shards optimizer state, gradients, and parameters across GPUs to fit massive models in memory. CPU offloading for even bigger ones.
  • FSDP (Fully Sharded Data Parallel) — Native PyTorch equivalent of ZeRO. The default modern choice if you're already in PyTorch.
  • Megatron-LM (NVIDIA) — Tensor and pipeline parallelism for the very largest models. The reference implementation for splitting a single layer across many GPUs.
  • Colossal-AI — Open-source distributed training, easier ergonomics than the above.

The three flavors of parallelism

Why so much complexity? Because a 70B model in FP16 takes ~140 GB just for weights — plus another 140 GB of gradients, ~560 GB of Adam optimizer state, plus activations. Total well over 1 TB. No single GPU holds this, so you split it.

  • Data parallelism: Same model on every GPU, different batch on each. Synchronize gradients after each step.
  • Tensor parallelism: Split each layer across GPUs. A single matmul becomes a coordinated operation across the group (Megatron-style).
  • Pipeline parallelism: Put different layers on different GPUs. Batches flow through like an assembly line.

Most large training runs use all three combined — "3D parallelism." Frontier runs add expert parallelism for MoE on top.


Inference Software — how do you run these things efficiently?

What inference software actually does

When you send a prompt to a model, a lot happens between "text in" and "text out": tokenization, loading weights into GPU memory, running the forward pass for each generated token, sampling, detokenization. Inference software handles all of this — and the difference between a naive implementation and a well-optimized one can be 10–100x in throughput.

The core challenge: LLM inference is memory-bandwidth bound, not compute-bound. Modern GPUs can do trillions of math operations per second, but they spend most of their time waiting to read weights from VRAM. Almost every optimization is, at its heart, about moving less data or reusing it more cleverly.

The two phases of inference

Understanding this split is essential for everything else:

  • Prefill is processing the prompt. All input tokens are processed in parallel in one big matrix multiplication. This is compute-bound and fast per token.
  • Decode is generating output, one token at a time. Each new token needs to attend to all previous tokens. This is memory-bound and much slower per token.

This asymmetry — prompts are cheap, generation is expensive — drives most of the engineering.

KV cache: the central data structure

In a transformer, attention requires the keys and values for every previous token. Recomputing them every step would be wildly wasteful, so they're cached — this is the KV cache.

The KV cache grows linearly with sequence length and dominates memory during long generations. For a 70B model, the KV cache for a single 8K-token conversation can be several GB. Managing this cache efficiently is arguably the central problem in inference serving.

Key optimization techniques

Continuous batching (sometimes called in-flight batching): The naive approach batches requests together, but they finish at different times — the whole batch waits for the slowest one. Continuous batching swaps finished sequences out and new ones in at every step, keeping the GPU full. This was popularized by vLLM and is now standard.

PagedAttention: vLLM's signature contribution. Instead of allocating one big contiguous KV cache per request, it splits the cache into fixed-size pages (like virtual memory in operating systems). This nearly eliminates memory fragmentation and lets you fit many more concurrent requests in the same VRAM.

Speculative decoding: A small "draft" model generates several candidate tokens quickly, then the big model verifies them all in one parallel pass. If the draft was right, you got multiple tokens for the cost of one big-model forward pass. Common speedup: 2-3x.

Quantization at inference time: Running the model in INT8, INT4, or even lower precision. Less data to move = faster. Methods like GPTQ, AWQ, and GGUF have made 4-bit inference shockingly good — often within 1-2% of full precision quality.

FlashAttention: A reimplementation of the attention operation that's mathematically identical but uses GPU memory hierarchy more cleverly. Pretty much everyone uses it now. FlashAttention-2 and -3 added further refinements.

Prefix caching: If many requests share a common prefix (like a system prompt), cache the KV state for that prefix once and reuse it across all requests. Huge win for chat applications.

The major inference engines

vLLM — Open source, originated at Berkeley. Probably the most popular self-hosted option. Strong on continuous batching and PagedAttention. Supports most popular model architectures.

TensorRT-LLM — NVIDIA's offering. Squeezes the most performance out of NVIDIA hardware specifically. More setup work, but often the fastest option if you're locked into NVIDIA.

SGLang — Newer, from LMSYS. Particularly strong at structured generation and complex prompting patterns. Often benchmarks competitively with vLLM.

TGI (Text Generation Inference) — Hugging Face's serving framework. Production-grade, integrates well with the HF ecosystem.

llama.cpp — The CPU/local-first option. Written in C++, runs on basically anything — laptops, phones, Raspberry Pis. The GGUF format is its quantized model format. Not the fastest in absolute terms, but unmatched for portability.

Ollama — A wrapper around llama.cpp that makes local model running ridiculously easy. ollama run llama3 and you're done. Great for developers experimenting locally.

MLX — Apple's framework for running models on Apple Silicon. Takes advantage of unified memory architecture, which is genuinely good for LLM inference (no CPU↔GPU transfer).

ExLlamaV2 — Hyper-optimized for running quantized models on consumer NVIDIA GPUs. Popular in the local-LLM enthusiast scene.


Variants of silicon

Training silicon vs inference silicon

These have very different requirements, and the hardware is starting to bifurcate accordingly.

Training silicon needs:

  • Massive memory per chip (HBM, 80–192 GB)
  • Very fast inter-chip interconnect (NVLink, InfiniBand) — gradients need to sync across the whole pod every step
  • High BF16/FP16 throughput
  • Reliability over week-long runs

Inference silicon can specialize differently:

  • Lower precision is fine (INT8, INT4, FP8)
  • Smaller per-chip memory if the model fits
  • Focus on tokens/sec/dollar and first-token latency
  • Power efficiency matters more (you serve forever; you train once)

NVIDIA datacenter GPUs

  • H100 (Hopper) — Current workhorse. 80 GB HBM3, ~3 TB/s memory bandwidth. Most frontier training and serving runs on these.
  • H200 — H100 with more memory (141 GB HBM3e). Same compute, much better for inference because larger KV caches fit.
  • B100 / B200 (Blackwell) — Next gen. Big jump in compute, larger NVLink domains (so you can treat more GPUs as one logical unit).
  • A100 (Ampere) — Previous gen, still everywhere. Cheaper on the secondary market.

Consumer NVIDIA

  • RTX 4090 — 24 GB VRAM, ~1 TB/s. The de facto enthusiast LLM card.
  • RTX 5090 — 32 GB VRAM, faster memory. New flagship, hard to find at MSRP.
  • RTX 3090 — 24 GB, slower than 4090 but cheap on the used market. Popular base for multi-GPU rigs.

Apple Silicon

The interesting wildcard for local inference. Apple's chips use unified memory: the CPU and GPU share the same RAM directly, with no PCIe transfer between them.

  • M2/M3/M4 Ultra in a Mac Studio can be configured with up to 192 GB of unified memory.
  • Memory bandwidth tops out around 800 GB/s on Ultra — less than dedicated datacenter GPUs, but plenty for inference.
  • Best ratio of "VRAM per dollar" you can buy retail. A maxed-out Mac Studio gets you running 70B-class models comfortably and 100B+ with quantization.

AMD

  • MI300 / MI325 — Competitive datacenter chips, large memory (192 GB on MI300X). Software ecosystem (ROCm) is improving but still trails CUDA.
  • Radeon consumer cards — Work for inference via llama.cpp, but tooling is consistently a step behind NVIDIA.

Specialized inference chips

  • Groq — LPU (Language Processing Unit). Insanely fast token generation through deterministic, compiler-scheduled architecture. SRAM-only design — works for smaller models, doesn't scale to frontier sizes per chip.
  • Cerebras — Wafer-scale chip. Whole model lives on a single piece of silicon, sidestepping interconnect issues.
  • Etched, SambaNova, Tenstorrent — Various startup approaches; the field hasn't consolidated yet.

Google TPUs

  • v5e, v5p, and the newer Trillium (v6) generations. Tightly integrated with JAX. Used internally for Gemini training and serving. Available externally on GCP.

What can you actually run on consumer hardware?

The whole point of all of the above. Practical tiers for local inference (assuming 4-bit quantization, which is the sweet spot for quality vs memory):

Phone / Raspberry Pi (~4–8 GB RAM)

  • Tiny models: Phi-3.5-mini (3.8B), Llama 3.2 1B/3B, Gemma 2B
  • Useful for narrow tasks, classification, on-device summarization
  • 5–30 tokens/sec depending on chip

Laptop CPU / 16 GB RAM

  • 7–14B models in 4-bit (Llama 3 8B, Mistral 7B, Qwen 2.5 14B)
  • 5–15 tokens/sec via llama.cpp or Ollama
  • Genuinely useful for chat, coding assistance, drafting

Single consumer GPU (RTX 4090, 24 GB VRAM)

  • Up to ~70B models in 4-bit (Llama 3.3 70B, Qwen 2.5 72B)
  • Or 30B-class models in 8-bit with full context
  • 30–100 tokens/sec
  • Sweet spot for serious local work

Mac Studio M2/M3 Ultra (96–192 GB unified memory)

  • 70B–120B+ models in 4-bit with comfortable headroom
  • Mixtral 8x22B, DeepSeek-V3 distillations, even Llama 3.1 405B at heavy quantization
  • 10–40 tokens/sec depending on model size
  • Often the easiest path to running very large models at home — quietly, without a server rack

Multi-GPU rig (2–4x RTX 3090 / 4090)

  • 70–120B models with room for context
  • Higher raw throughput than a Mac for similar quality
  • Real cost: power, noise, heat, and you're now a sysadmin

The sweet spot for agentic workloads — running tool-using models locally — is currently a 70B-class model on either a 4090 with aggressive quantization or a Mac Studio with plenty of unified memory. You give up some quality vs frontier API models, but you gain privacy, no per-token cost, and the ability to run continuously on something sitting on your desk.