If you’ve never run a model locally and you’re wondering whether your hardware can handle it – good news, the barrier to entry is way lower than you’d think. A £200 used GPU from eBay handles models that genuinely surprise people. Thousands of pounds? Not required. But you do need to know which single number on the spec sheet actually matters, because most of them really don’t.
About eighteen months ago I bought a used RTX 3090, mostly because I was tired of paying API costs every time I wanted to experiment with a model. I’d pull a 13B model, chat with it, try a bigger one, hit the VRAM wall, and immediately start thinking about a second card. The 3090 turned into a Threadripper 5990x workstation with six NVIDIA Ada GPUs and 104GB of VRAM – it sits in my office, runs all day, and I’ve even built an MCP for Claude Code to work with my local LLM (running LM Studio).
In today’s article, I’m going to talk about the hardware mistakes and lucky finds, from a £200 GPU shoved into my work PC to mini PCs with 128GB of unified memory running 70B models from under a desk.
Quick Navigation
Jump directly to what you’re looking for:
VRAM & Quantisation |
Hardware Secrets |
Budget (Under £800) |
RTX 3060 Benchmarks |
Mini PCs for AI |
Mid-Range (£800-£3,000) |
Premium (£3,000+) |
Does the CPU Matter? |
Software Stack |
Comparison Table
The Number That Matters
Something I’ve learned building these rigs: when it comes to PC workstations, VRAM on the GPU decides everything. Not clock speed, not CUDA cores, not the number NVIDIA puts on the box. If a model fits in your GPU’s memory, it runs fast enough. If it doesn’t fit, you’re at two tokens per second because your machine will most likely attempt to fit the rest of the model in your main RAM – it still works but it’s slooow!
People overcomplicate this, but let’s break it down simply. Every parameter in your model has to sit in memory somewhere – you can’t get around that. At full precision (FP16), one parameter costs you 2 bytes. A 70 billion parameter model at full precision is 140GB. No consumer GPU on the planet has that kind of VRAM, yet. Even 3 of those 48GB modified “4090D” cards you see on eBay would probably melt. (There are “4090 D’s” on eBay that have been reboarded to accommodate 48GB. I am so, so tempted – the boards come out of the same factories as the NVIDIA cards, they swap over the GPU chip and add better RAM. A lot less sketchy than you’d think.)
Quantisation fixes this. Compress those billions of parameters down to 4-bit (Q4_K_M is the format you’ll see everywhere) and each one drops to roughly half a byte. That 70B model goes from 140GB to about 40GB – two used RTX 3090s with room for context window overhead. I’ve been running Qwen 3 Coder Next at Q6 quantisation on my own rig for a couple of months now and can’t feel any quality difference from full precision on the tasks I throw at it. I wrote up the whole process in my LM Studio setup guide if you want to try the same thing on your own rig.
One thing that’s changed since I first wrote this article: Llama 4 Scout landed with 109B total parameters in a Mixture-of-Experts architecture. Only 17B parameters are active at any time, but MoE models need all parameters loaded in memory. That means ~55-70GB of VRAM just to load it at INT4. A single RTX 4090 can’t touch it. This is pushing people toward either high-RAM unified memory machines (the mini PCs below) or multi-GPU rigs. The VRAM arms race isn’t slowing down.
Quick VRAM Guide
So what sort of size model can run on your VRAM? Beware – 3080’s have 10GB versions so watch out if you’re buying second hand on eBay.
| Model Size | At Q4 (4-bit) | At Q8 (8-bit) | Good GPU Fit |
|---|---|---|---|
| 7B | 6-8 GB | 10-12 GB | RTX 3060 12GB, RX 9060 XT 16GB |
| 13-14B | 10-12 GB | 16-18 GB | RTX 3060 12GB, RX 9060 XT 16GB, 4060 Ti 16GB |
| 34B | 20-24 GB | 30+ GB | RTX 3090 or 4090 (24GB) |
| 70B | ~40 GB | ~75 GB | Dual 3090s, Mac Studio, RTX 5090, or 128GB mini PC |
| 109B MoE (Llama 4 Scout) | ~55-70 GB | ~110+ GB | 128GB unified memory (Framework/GMKtec/DGX Spark) |
| 100B+ dense | 60-70 GB | 100+ GB | Quad 3090s, M3 Ultra 192GB |
Don’t forget KV cache on top of this – it stores your conversation state and grows with context length. At 32k tokens, budget for another 2-4GB, which caught me out the first time I tried to squeeze a 34B onto a 24GB card.

Local LLM Hardware Secrets
Here are three bits of hardware wisdom I’ve had to learn the expensive way.
One Fast Card Beats Two Slow Ones
Tempting maths: two RTX 3060 12GB cards give you 24GB total. Same VRAM as a single 3090. Same capacity, completely different speed. This is a big mistake on my part – I bought an array of bargain Ada generation RTX 4000s and 4500s – the mixture of the cards and the volume of them was a mistake. It runs but I think I’m losing at least 20% of the performance simply because of all the PCI lanes in play.
Digital Spaceport did the numbers. A single 3090 hits 28 tokens per second on Gemma 3 27B at Q4. The dual 3060 setup? Six. On the exact same model. Splitting a model across GPUs over PCIe – sharding, they call it – kills throughput because the cards spend more time talking to each other than doing inference. I really, really wish I understood this before I sold my 3090’s from my GPU mining days.
So when does multi-GPU work? When the model needs both cards anyway. Two 3090s running a 70B model that requires 48GB of VRAM is fine – fifteen to twenty tok/s with NVLink. But don’t buy two cheap cards hoping they’ll match one expensive one. They won’t. Don’t mix generations of cards, don’t mix VRAM numbers – and most consumer “gaming” motherboards don’t support full 16-channel PCI on more than one of the PCI slots. Simple is actually the best approach.
NVLink for Multi-GPU Fine Tuning
Something I genuinely didn’t expect when I added my second GPU: the connection between the cards ends up mattering almost as much as the cards themselves in specific use cases. What NVLink does is give the GPUs their own private highway – 112.5 GB/s bidirectional. Compare that with regular PCIe 4.0 x8, which tops out around 16 GB/s. About seven times slower, and you absolutely notice it in practice.
The caveat: NVLink is better for fine tuning performance, not for inference (chat!) – oh well.
Apple Silicon: Capacity Over Speed
A Mac Studio M3 Ultra with 192GB of unified memory can load models that would need four discrete NVIDIA GPUs on a PC. All that RAM is GPU-accessible. No PCIe bottleneck, no sharding penalty. Near-silent, too, which matters if (like me) you’re working in the same room as the hardware.
Speed-wise, NVIDIA is quicker – about 2-3x on models that fit in its VRAM. A dual 3090 PC does 15-20 tok/s on 70B; the M3 Ultra manages 8-12 tok/s on the same model. Where the Mac pulls ahead is models above 100B parameters that the PC can’t touch without a quad-GPU build, and frankly, for research tasks where you’re running huge models rather than chatting interactively, it makes more sense than people give it credit for.
Unified Memory Changes Everything
So here’s what actually changed in 2026. Apple proved the concept with Apple Silicon years ago, but now AMD’s Strix Halo chips bring the same unified memory architecture to Windows PCs. The Framework Desktop, Corsair AI Workstation 300, and GMKtec EVO-X2 all pack 128GB of shared memory that the integrated GPU can access directly. No PCIe bus, no sharding. You load a 70B model into memory and the GPU just… uses it.
The trade-off is speed. These integrated GPUs are slower than a dedicated RTX card on models that fit in discrete VRAM. But for models that don’t fit – 70B, Llama 4 Scout, anything MoE – unified memory machines are the only option under £3,000 that doesn’t involve multiple GPUs and a wiring diagram. I got into all of this in my beginner’s guide to AI mini PCs and the DGX Spark – worth reading if the unified memory thing is new to you.
Budget: Under £800
RTX 3060 12GB
The cheapest way into serious local AI. Twelve gigabytes of VRAM at 170W TDP, running 7B models at Q8 or 13B models at Q4 – Llama 3, Mistral, Phi-3, all the capable smaller models that have come out this past year.
Why this over a newer RTX 4060? Because the 4060 only ships with 8GB of VRAM, and for AI work, 12GB from an older generation beats 8GB from a newer one every single time – pretty much consensus in the local LLM community at this point. Pick one up for about £200 on Amazon, pair it with a used HP Z440 workstation off eBay (about £100) and you’ve got a complete AI rig for under £350. System idles at around 65 watts.
RTX 3060 12GB: Actual Benchmarks
People keep searching for specific numbers on this card, so here they are. These are community benchmarks from Hardware Corner and Digital Spaceport, confirmed against my own testing where I could:
| Model | Quantisation | Tokens/sec | Context |
|---|---|---|---|
| Llama 3 8B | Q4_K_M | ~42-50 | 4k-16k |
| Mistral 7B | Q4_K_M | ~40-50 | 4k-16k |
| Qwen 2.5 14B | Q4_K_M | ~22-23 | 16k |
| Qwen 2.5 14B | 5-bit EXL2 | ~30-33 | 8k |
| Phi-3 14B | Q4_K_M | ~22-25 | 16k |
| Any 20B+ model | Q4 | ~9 | Limited |
The 14B sweet spot is genuinely impressive for a £200 card. Thirty tokens per second on Qwen 2.5 14B at 5-bit EXL2 – honestly, at that speed I forget I’m running it locally rather than hitting an API. The hard ceiling hits around 20B parameters, and after that you’re spilling into system RAM and it drops to single digits. Still usable if you’re batching things overnight, but not for interactive chat.
If you’re running ExLlamaV2 (which you should be for GPU-only inference on NVIDIA), the 360 GB/s memory bandwidth on the 3060 actually outperforms the RTX 4060 on token generation. Newer architecture doesn’t matter when you simply don’t have enough VRAM.
NVIDIA RTX 3060 12GB
- VRAM 12GB GDDR6
- Best for 7-14B models, budget entry
AMD RX 9060 XT 16GB
- VRAM 16GB GDDR6
- Best for 7-14B models, future-proof budget
RX 9060 XT 16GB (AMD’s New Budget Option)
Worth knowing about if you’re buying new in 2026. Sixteen gigabytes of GDDR6 on AMD’s RDNA 4 architecture for about £300. Four extra gig over the RTX 3060 means 14B models at Q8 fit comfortably, and you can squeeze 24B models in at aggressive quantisation.
The catch is software. AMD’s ROCm stack for AI workloads has improved massively over the past year, but it’s still behind CUDA in terms of compatibility and community support. Most local LLM tools work – Ollama, llama.cpp, LM Studio all support AMD now – but you’ll hit more edge cases than you would on NVIDIA. If you’re comfortable troubleshooting, the extra VRAM is worth it. If you want everything to just work first time, the RTX 3060 is still the safer buy.
RTX 3090 24GB (Used/Renewed)
Yeah, it’s two generations old. The local AI community collectively shrugged at that ages ago and kept buying them.
Twenty-four gigabytes of GDDR6X handles 34B models at Q4 or 70B at tight quantisation. I ran one of these for about a year before the Threadripper build happened, and looking back I’m slightly embarrassed at how long I underestimated what a single 24GB card could handle. Community benchmarks from Digital Spaceport show 28-36 tok/s on 14B models, 28 tok/s on Gemma 3 27B Q4. Nothing under a grand comes close to that combination of capacity and speed – and it’s got NVLink support for when you inevitably want to add a second one.
Renewed cards run £650-800 on Amazon. Most sellers give you about 90 days of warranty. Bit of a gamble, but I’ve not heard of widespread failure rates from the AI community. If you’re planning multi-GPU later, look for blower-style cards – they exhaust heat out the back instead of dumping it onto the card above. The 350W TDP per card adds up fast when you’ve got two of them in the same case.
Mini PCs for AI
This is the section that didn’t exist when I first wrote this article, and it’s probably the most important update. The whole landscape shifted when AMD shipped Strix Halo – a laptop-class chip with 128GB of unified memory that the integrated GPU can access directly. Suddenly you can run 70B models from a box that fits on a shelf and draws 120W. No discrete GPU needed.
I covered the technology in depth in my beginner’s guide to AI mini PCs and the DGX Spark, but here’s the practical buying guide.
Framework Desktop (128GB)
Honestly? If I were starting from scratch today this is probably where my money would go. AMD Ryzen AI Max+ 395, 128GB LPDDR5X unified memory, crammed into a 4.5-litre case that Framework co-designed with Cooler Master and Noctua. And because it’s Framework, the whole thing is modular – you can swap the front panel tiles, the fans, even 3D print custom bits.
96GB of that 128GB is allocatable to the GPU. Runs 70B models. Llama 4 Scout fits (just). Near-silent under inference load.
The catch: LPDDR5X prices have gone through the roof. Framework originally priced the 128GB model at $1,999 but it’s now up to around $2,459 (~£1,970) due to memory supply constraints. Still the cheapest 128GB unified memory machine you can buy, and the modular design means you’re not throwing the whole thing away when the next generation of chip arrives.
Corsair AI Workstation 300
Corsair’s answer to the same question. Same Strix Halo chip, same 128GB of LPDDR5X, but in Corsair’s own compact 4.4-litre chassis with a 300W Flex ATX PSU. Shipping now at around $2,500 (~£2,000).
Less modular than Framework, but Corsair’s build quality and cooling are proven. If you already trust Corsair hardware (and half the PC gaming community does), this is the path of least resistance into unified memory AI.
GMKtec EVO-X2
The surprise entry that started this whole mini PC category. Same AMD Ryzen AI Max+ 395, same 128GB option, slightly different cooling approach. Around £2,000-2,500 on Amazon.
It was the first to market and the early reviews are solid. Speed won’t match a 3090, mind you – somewhere around 10-15 tok/s on 27B models from what I’ve seen. For an always-on inference box that handles 70B from under your desk without waking the house though, I haven’t found anything else in this bracket. Plus I expect to see gen 1 Strix Halo mini PCs on eBay for £500-600 in two years’ time once the next chip generation lands.
Framework Desktop
- 128GB unified memory
- AMD Strix Halo
- 4.5L case
“Modular, repairable”
Corsair AI Workstation 300
- 128GB unified memory
- AMD Strix Halo
- 4.4L case
“Shipping now”
GMKtec EVO-X2
- 128GB unified memory
- AMD Strix Halo
- Mini PC form factor
“96GB allocatable VRAM”
NVIDIA DGX Spark
- 128GB unified memory
- Grace Blackwell
- Desktop form factor
“1 petaFLOP FP4”
Mid-Range: £800 – £3,000
Mac Mini M4 Pro (24GB)
Apple’s cheapest route into unified memory for AI work. Twenty-four gig of unified memory, which handles 13-14B models nicely through MLX. Slower on raw tok/s than a 3090, but the software side is painless – Ollama runs natively, no CUDA drivers to wrestle with. £1,399 on Amazon.
Not going to touch 70B, not remotely. But for 7-14B work – coding assistants, summarisation, local chatbots – genuinely lovely quiet machine that does exactly what you’d want. If you’re on macOS already and want to dip a toe into local inference, this is probably where I’d point you first.
RTX 4000 Ada (Workstation, 20GB)
I run these in my own rig and they’ve been brilliant. Single-slot form factor at 130W per card, twenty gig of VRAM each. Stick four of them in a standard workstation case and you’re sitting on 80GB total at 520W combined – more than enough for 70B models at Q5 with headroom left over for context windows.
I’ve got six in my Threadripper 5990x (mixed with RTX 4500 Adas) for 104GB total. Quiet enough to sit in my office all day, which was the main engineering constraint because I’m working next to it eight hours a day. The whole system pulls about 800W under full inference load – sounds like a lot until you compare it with a quad 3090 setup drawing 1,400W. Raw tok/s per card is lower than gaming GPUs, but the density and power efficiency are what sold me for a machine that runs continuously. About £1,150 each on Amazon.
Dual RTX 3090 Build
The prosumer sweet spot for people who want 70B models on NVIDIA hardware. Two 3090s together give you 48GB of total VRAM. Bridge them with NVLink and you’re looking at 15-20 tok/s on 70B Q4. Skip the bridge and it drops to 10-14 tok/s, which sounds bad until you actually try it – still plenty fast enough to hold a conversation with a model.
Build essentials: the pair of cards will set you back £1,300-1,500 used. The PSU situation gets interesting because each card wants 350W under load, so budget for a 1,200-1,600W unit. For the platform, Threadripper or HEDT gives you full x16/x16 PCIe bandwidth – consumer boards like Z790 or X670E split to x8/x8, which works but costs some throughput. An NVLink bridge runs about £40-60 used. Whole thing comes in at £1,800-2,200 depending on your platform choice, and no, nobody sells this as a pre-built – you’re getting your hands dirty.
Premium: £3,000+
RTX 5090 (32GB)
The biggest single card you can walk into a shop and buy. Thirty-two gigabytes of GDDR7, 512-bit bus, Blackwell architecture – and for the first time, a quantised 70B model actually fits on one card. No sharding, no NVLink, no dual-GPU headaches. One slot, done.
Bad news on pricing, though. The £1,799 MSRP is a fantasy at this point – GDDR7 supply constraints and AI demand have pushed real UK street prices to £2,899 for the cheapest models (Zotac Solid from Overclockers UK) and up to £3,500-4,400 for premium cards from ASUS and MSI. Used models are hovering around £2,700 on eBay. Budget £3,000 minimum and don’t expect it to improve before mid-2026.
The 575W TDP is substantial, too – make sure your PSU can handle it before you get excited and order one.
Interesting side note: Gigabyte launched the AORUS RTX 5090 AI Box – an external GPU enclosure with Thunderbolt 5 that’s specifically marketed for AI workloads. If you’ve got a laptop with Thunderbolt 5, you could run 70B models through an external box. Haven’t tested it myself but the concept is right.
NVIDIA DGX Spark
NVIDIA’s “personal AI supercomputer” that I covered in detail in my beginner’s guide to AI mini PCs. The Grace Blackwell GB10 chip with 128GB of unified LPDDR5X and up to 1 petaFLOP of FP4 performance. This is the premium version of the same unified memory concept as the Strix Halo mini PCs above, but with NVIDIA’s own silicon and full CUDA stack.
Originally launched at $3,999, but NVIDIA hiked the price to $4,699 (~£3,800) in February 2026 due to memory supply constraints. Available on Amazon and direct from NVIDIA. If you want 128GB of unified memory with NVIDIA’s ecosystem rather than AMD’s, this is it – but you’re paying a significant premium over the Framework Desktop for that CUDA compatibility.
RTX 4090 (24GB)
Still the fastest card with 24GB of VRAM, and by a decent margin over the 3090 on raw tok/s. Same VRAM ceiling though, and that’s the catch – twenty-four gig is twenty-four gig regardless of what you paid. Buying new? Get this one. Buying used? The 3090 at roughly half the price gives you the same model capacity – which is the metric that matters for local AI. £1,600-2,000 on Amazon.
Mac Studio M4 Max / M3 Ultra
For running the biggest models money can buy in a desktop form factor. The M4 Max with 128GB (from £3,999) runs 70B models at high quantisation with room for long context windows. The M3 Ultra at 192GB (from £5,999) remains the capacity flagship. Apple cancelled the M4 Ultra entirely – they’re skipping straight to M5 Ultra, expected around June 2026 at WWDC. If you’re considering an Ultra, you might want to wait a couple of months.
A hundred and ninety-two gigabytes of unified memory handles 100B+ parameter models that would need a quad-GPU PC build to match. Both from apple.com only. Same trade-off as the Mac Mini: slower tok/s than NVIDIA on models that fit in NVIDIA VRAM, but a capacity ceiling nothing else touches in a quiet box.
Quad RTX 3090 Build (AM4/AM5)
Digital Spaceport validated this build: four RTX 3090s on an AM4 B550 motherboard. Ninety-six gigabytes of VRAM. We’re talking 100-180 tok/s on 12-20B models, which is absurd throughput. Price per GB of VRAM works out to roughly £30/GB – the cheapest path to serious capacity if you don’t mind some noise.
The PSU needs to be a 2,000W unit minimum and you’ll want a case with serious airflow (or an open-air test bench, which is what most people building these seem to end up with). Fair warning: your partner will comment on the noise. You’ve basically built a small datacenter that happens to live under your desk. Budget: £3,000-3,500 for GPUs plus platform.
Does the CPU Matter?
Short answer: not much for inference, and I say this as someone running a Threadripper 5990x. The GPU does almost all the work during token generation. Where the CPU matters is prompt processing (the initial “thinking” phase before the model starts responding) and if you’re offloading layers to system RAM because your model doesn’t quite fit in VRAM.
For a dedicated inference machine, any modern 6-core CPU is fine. Don’t spend £500 on a CPU when that money could go toward more VRAM. The one exception is the unified memory machines (Framework Desktop, DGX Spark) where the CPU and GPU share memory bandwidth – there, the chip choice is the whole machine.
Software Stack
Buying the hardware’s actually the easy bit – it’s the software stack where people tend to get stuck.
Ollama – installed it the day I got my first 3090, and it’s still the one I’d tell anyone to start with. The whole workflow is ollama pull llama3:70b and then you’re chatting. Quantisation handled for you, works on everything. Benchmarks I’ve looked at suggest you lose maybe 10-30% on raw throughput versus running llama.cpp bare – which sounds bad until you remember Ollama had you running models in five minutes flat while you’d still be reading llama.cpp compile flags.
LM Studio has the best GUI experience I’ve found for local models. Built-in model browser, chat-with-your-files (that’s RAG), no terminal needed. Perfect if terminals make you nervous. I use LM Studio on my inference rig alongside the houtini-lm MCP server I built for offloading work from Claude Code to cheaper models. I also wrote a full setup guide if you want to get started.
llama.cpp is the speed baseline that everything else gets measured against. More config, more control, faster output. Serious multi-GPU setups tend to run this directly rather than going through Ollama’s wrapper.
text-generation-webui – oobabooga’s project, and honestly the one that taught me most about how inference actually works. You pick between ExLlamaV2 (fastest GPU-only loader) or llama.cpp (flexible CPU offloading) depending on your hardware situation. Learning curve is real, took me a solid weekend to get comfortable, but once you’re past that you can tune everything and understand why your settings matter.
GGUF vs EXL2
Two model formats worth knowing about. GGUF runs everywhere – Macs, mixed CPU/GPU setups, systems where the model doesn’t quite fit in VRAM. Universal format. EXL2 is NVIDIA GPU-only but faster when the model fits entirely in VRAM.
On Apple Silicon: GGUF via MLX or llama.cpp. Got enough NVIDIA VRAM? EXL2 for best speed. Not sure which? GGUF. It always works.

Hardware Compared
| Hardware | VRAM | Price (GBP) | tok/s (14B) | tok/s (27B+) | Best For |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12GB | ~200 | ~42 (8B) / ~23 (14B) | – | Budget entry, 7-14B models |
| RX 9060 XT 16GB | 16GB | ~300 | ~25 est | – | Budget AMD, 14B at Q8 |
| RTX 3090 (used) | 24GB | 650-800 | 28-36 | ~28 | Best value for serious work |
| Mac Mini M4 Pro | 24GB unified | 1,399 | ~15-20 est | – | Silent macOS, 13B models |
| Framework Desktop | 128GB unified | ~1,970 | ~15-20 est | ~10-15 est | 70B mini PC, modular |
| Corsair AI Workstation 300 | 128GB unified | ~2,000 | ~15-20 est | ~10-15 est | 70B mini PC, Corsair build |
| GMKtec EVO-X2 | 96GB alloc | ~2,000-2,500 | ~15-20 est | ~10-15 est | 70B mini PC, first to market |
| RTX 4000 Ada | 20GB | 1,150 | ~20-25 est | – | Multi-GPU builds, low power |
| Dual 3090 (NVLink) | 48GB | 1,800-2,200 | 30+ | 15-20 | 70B models, prosumer |
| RTX 4090 | 24GB | 1,600-2,000 | ~40-50 est | ~35 est | Fastest 24GB option |
| RTX 5090 | 32GB | 2,900-3,500 | ~50+ est | ~40 est | Single-GPU 70B, no sharding |
| DGX Spark | 128GB unified | ~3,800 | ~15-20 est | ~10-15 est | 128GB NVIDIA ecosystem |
| Mac Studio M3 Ultra | 192GB unified | 5,999+ | ~10-15 est | ~8-12 est | 100B+ models, nothing else can |
| Quad 3090 | 96GB | 3,000-3,500 | 100-180 | 26+ | Maximum VRAM on a budget |
Benchmarked figures from Digital Spaceport and Hardware Corner. Estimates marked ‘est’ from Gemini research and community reports. Mini PC tok/s varies significantly by model size and quantisation.
Corsair VENGEANCE a7500 AIR
- GPU RTX 5090 32GB
- RAM 192GB DDR5
- CPU Ryzen 9 9950X3D
- Storage 6TB NVMe
Corsair AI Workstation 300
- Memory Up to 128GB unified
- VRAM Up to 96GB allocatable
- Chip AMD Strix Halo
- Form factor 4.4L compact
What I’d Actually Buy
Had someone asked me this question two years ago I’d have said “whatever has the most VRAM under a grand.” My answer hasn’t really changed. Under £800, a used RTX 3090 is still the obvious play. Twenty-four gig of VRAM for under £800, NVLink ready for the inevitable second card, and enough capacity to run every model up to 34B at decent quantisation. Exactly where I started, and knowing what I know now, I’d make the same call.
If you’re on a tight budget and £800 is a stretch, the RTX 3060 12GB at £200 gets you surprisingly far. Forty-two tok/s on Llama 3 8B. Thirty tok/s on Qwen 2.5 14B with the right quantisation. Pair it with a £100 used workstation and you’re running local AI for less than two months of ChatGPT Plus.
Between £800 and £3,000, things have genuinely shifted since I first wrote this article – and it’s the mini PCs that did it. If you want 70B from a box you can hide on a shelf, the Framework Desktop at ~£1,970 for 128GB of unified memory is where I’d look first – it’s modular, it’s repairable, and when the next generation of chip arrives you can upgrade without binning the whole machine. Already on macOS and working with smaller models? Mac Mini M4 Pro. Interested in going down the same rabbit hole I went down? RTX 4000 Ada cards in a Threadripper workstation – quiet, dense, and the power draw won’t terrify you.
Above £3,000, the RTX 5090 is the obvious pick if you can find one at a sane price (budget £3,000 minimum right now) – one card, one slot, 70B without any of the multi-GPU headaches. For NVIDIA’s take on unified memory, the DGX Spark at £3,800 gives you 128GB and the full CUDA stack. For maximum capacity on a budget, the quad 3090 build on AM4 gets you 96GB of VRAM at about £30 per gigabyte. Ugly, loud, and you’d struggle to find anything with that much VRAM for less money.
Buy the most VRAM you can afford. Pretty much everything else is secondary.
Related Posts
Claude Desktop System Requirements: Windows & macOS
Have you found yourself becoming a heavy AI user? For Claude Desktop, what hardware matters, what doesn’t, and where do Anthropic’s official specs look a bit optimistic? In this article: Official Requirements | Windows vs macOS | What Actually Matters | RAM | MCP Servers | Minimum vs Comfortable | Mistakes Official Requirements Anthropic doesn’t … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>
Best GPUs for Running Local LLMs: Buyer’s Guide 2026
I’ve been running various LLMs on my own hardware for a while now and, without fail, the question I see asked the most (especially on Reddit) is “what GPU should I buy?” The rules for buying a GPU for AI are nothing like the rules for buying one for gaming – CUDA cores barely matter, … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>
A Beginner’s Guide to Claude Computer Use
I’ve been letting Claude control my mouse and keyboard on and off to test this feature for a little while, and the honest answer is that it’s simultaneously the most impressive and most frustrating AI feature I’ve used. It can navigate software it’s never seen before just by looking at the screen – but it … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>
A Beginner’s Guide to AI Mini PCs – Do You Need a DGX Spark?
I’ve been running a local LLM on a variety of bootstrapped bit of hardward, water-cooled 3090’s and an LLM server I call hopper full of older Ada spec GPUs. When NVIDIA, Corsair, et al. all started shipping these tiny purpose-built AI boxes – the DGX Spark, the AI Workstation 300, the Framework Desktop – I … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>
Content Marketing Ideas: What It Is, How I Built It, and Why I Use It Every Day
Content Marketing Ideas is the tool I’ve built to relcaim the massive amount of time I have to spend monitoring my sources for announcementsm ,ew products, release – whatever. The Problem with Content Research in 2026 Most front line content marketing workflow follows the same loop. You read a lot, you notice patterns, you get … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>
Are Claude Skills Just an Alternative to Reading a Book or is there more than that?
I’ve too long treating skills like magic incantations of a topic that really, I don’t fully understand. I strated out not really thinking about skills or embracing them. I still don’t, fully, becuase most of what I do is command line, terminal, etc etc – I’m on top of computer use! BUT – I have … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>