Best PCs for Local AI: GPUs, Mini PCs & Builds Tested

If you’ve never run a model locally and you’re wondering whether your hardware can handle it – good news, the barrier to entry is way lower than you’d think. A £200 used GPU from eBay handles models that genuinely surprise people. Thousands of pounds? Not required. But you do need to know which single number on the spec sheet actually matters, because most of them really don’t.

About eighteen months ago I bought a used RTX 3090, mostly because I was tired of paying API costs every time I wanted to experiment with a model. I’d pull a 13B model, chat with it, try a bigger one, hit the VRAM wall, and immediately start thinking about a second card. The 3090 turned into a Threadripper 5990x workstation with six NVIDIA Ada GPUs and 104GB of VRAM – it sits in my office, runs all day, and I’ve even built an MCP for Claude Code to work with my local LLM (running LM Studio).

In today’s article, I’m going to talk about the hardware mistakes and lucky finds, from a £200 GPU shoved into my work PC to mini PCs with 128GB of unified memory running 70B models from under a desk.

The Number That Matters

Something I’ve learned building these rigs: when it comes to PC workstations, VRAM on the GPU decides everything. Not clock speed, not CUDA cores, not the number NVIDIA puts on the box. If a model fits in your GPU’s memory, it runs fast enough. If it doesn’t fit, you’re at two tokens per second because your machine will most likely attempt to fit the rest of the model in your main RAM – it still works but it’s slooow!

People overcomplicate this, but let’s break it down simply. Every parameter in your model has to sit in memory somewhere – you can’t get around that. At full precision (FP16), one parameter costs you 2 bytes. A 70 billion parameter model at full precision is 140GB. No consumer GPU on the planet has that kind of VRAM, yet. Even 3 of those 48GB modified “4090D” cards you see on eBay would probably melt. (There are “4090 D’s” on eBay that have been reboarded to accommodate 48GB. I am so, so tempted – the boards come out of the same factories as the NVIDIA cards, they swap over the GPU chip and add better RAM. A lot less sketchy than you’d think.)

Quantisation fixes this. Compress those billions of parameters down to 4-bit (Q4_K_M is the format you’ll see everywhere) and each one drops to roughly half a byte. That 70B model goes from 140GB to about 40GB – two used RTX 3090s with room for context window overhead. I’ve been running Qwen 3 Coder Next at Q6 quantisation on my own rig for a couple of months now and can’t feel any quality difference from full precision on the tasks I throw at it. I wrote up the whole process in my LM Studio setup guide if you want to try the same thing on your own rig.

One thing that’s changed since I first wrote this article: Llama 4 Scout landed with 109B total parameters in a Mixture-of-Experts architecture. Only 17B parameters are active at any time, but MoE models need all parameters loaded in memory. That means ~55-70GB of VRAM just to load it at INT4. A single RTX 4090 can’t touch it. This is pushing people toward either high-RAM unified memory machines (the mini PCs below) or multi-GPU rigs. The VRAM arms race isn’t slowing down.

Quick VRAM Guide

So what sort of size model can run on your VRAM? Beware – 3080’s have 10GB versions so watch out if you’re buying second hand on eBay.

Model Size	At Q4 (4-bit)	At Q8 (8-bit)	Good GPU Fit
7B	6-8 GB	10-12 GB	RTX 3060 12GB, RX 9060 XT 16GB
13-14B	10-12 GB	16-18 GB	RTX 3060 12GB, RX 9060 XT 16GB, 4060 Ti 16GB
34B	20-24 GB	30+ GB	RTX 3090 or 4090 (24GB)
70B	~40 GB	~75 GB	Dual 3090s, Mac Studio, RTX 5090, or 128GB mini PC
109B MoE (Llama 4 Scout)	~55-70 GB	~110+ GB	128GB unified memory (Framework/GMKtec/DGX Spark)
100B+ dense	60-70 GB	100+ GB	Quad 3090s, M3 Ultra 192GB

Don’t forget KV cache on top of this – it stores your conversation state and grows with context length. At 32k tokens, budget for another 2-4GB, which caught me out the first time I tried to squeeze a 34B onto a 24GB card.

Column chart showing VRAM capacity in gigabytes for each hardware option from RTX 3060 at 12GB to Mac Studio M3 Ultra at 192GB

Local LLM Hardware Secrets

Here are three bits of hardware wisdom I’ve had to learn the expensive way.

One Fast Card Beats Two Slow Ones

Tempting maths: two RTX 3060 12GB cards give you 24GB total. Same VRAM as a single 3090. Same capacity, completely different speed. This is a big mistake on my part – I bought an array of bargain Ada generation RTX 4000s and 4500s – the mixture of the cards and the volume of them was a mistake. It runs but I think I’m losing at least 20% of the performance simply because of all the PCI lanes in play.

Digital Spaceport did the numbers. A single 3090 hits 28 tokens per second on Gemma 3 27B at Q4. The dual 3060 setup? Six. On the exact same model. Splitting a model across GPUs over PCIe – sharding, they call it – kills throughput because the cards spend more time talking to each other than doing inference. I really, really wish I understood this before I sold my 3090’s from my GPU mining days.

So when does multi-GPU work? When the model needs both cards anyway. Two 3090s running a 70B model that requires 48GB of VRAM is fine – fifteen to twenty tok/s with NVLink. But don’t buy two cheap cards hoping they’ll match one expensive one. They won’t. Don’t mix generations of cards, don’t mix VRAM numbers – and most consumer “gaming” motherboards don’t support full 16-channel PCI on more than one of the PCI slots. Simple is actually the best approach.

NVLink for Multi-GPU Fine Tuning

Something I genuinely didn’t expect when I added my second GPU: the connection between the cards ends up mattering almost as much as the cards themselves in specific use cases. What NVLink does is give the GPUs their own private highway – 112.5 GB/s bidirectional. Compare that with regular PCIe 4.0 x8, which tops out around 16 GB/s. About seven times slower, and you absolutely notice it in practice.

The caveat: NVLink is better for fine tuning performance, not for inference (chat!) – oh well.

Apple Silicon: Capacity Over Speed

A Mac Studio M3 Ultra with 192GB of unified memory can load models that would need four discrete NVIDIA GPUs on a PC. All that RAM is GPU-accessible. No PCIe bottleneck, no sharding penalty. Near-silent, too, which matters if (like me) you’re working in the same room as the hardware.

Speed-wise, NVIDIA is quicker – about 2-3x on models that fit in its VRAM. A dual 3090 PC does 15-20 tok/s on 70B; the M3 Ultra manages 8-12 tok/s on the same model. Where the Mac pulls ahead is models above 100B parameters that the PC can’t touch without a quad-GPU build, and frankly, for research tasks where you’re running huge models rather than chatting interactively, it makes more sense than people give it credit for.

Unified Memory Changes Everything

So here’s what actually changed in 2026. Apple proved the concept with Apple Silicon years ago, but now AMD’s Strix Halo chips bring the same unified memory architecture to Windows PCs. The Framework Desktop, Corsair AI Workstation 300, and GMKtec EVO-X2 all pack 128GB of shared memory that the integrated GPU can access directly. No PCIe bus, no sharding. You load a 70B model into memory and the GPU just… uses it.

The trade-off is speed. These integrated GPUs are slower than a dedicated RTX card on models that fit in discrete VRAM. But for models that don’t fit – 70B, Llama 4 Scout, anything MoE – unified memory machines are the only option under £3,000 that doesn’t involve multiple GPUs and a wiring diagram. I got into all of this in my beginner’s guide to AI mini PCs and the DGX Spark – worth reading if the unified memory thing is new to you.

Budget: Under £800

RTX 3060 12GB

The cheapest way into serious local AI. Twelve gigabytes of VRAM at 170W TDP, running 7B models at Q8 or 13B models at Q4 – Llama 3, Mistral, Phi-3, all the capable smaller models that have come out this past year.

Why this over a newer RTX 4060? Because the 4060 only ships with 8GB of VRAM, and for AI work, 12GB from an older generation beats 8GB from a newer one every single time – pretty much consensus in the local LLM community at this point. Pick one up for about £200 on Amazon, pair it with a used HP Z440 workstation off eBay (about £100) and you’ve got a complete AI rig for under £350. System idles at around 65 watts.

RTX 3060 12GB: Actual Benchmarks

People keep searching for specific numbers on this card, so here they are. These are community benchmarks from Hardware Corner and Digital Spaceport, confirmed against my own testing where I could:

Model	Quantisation	Tokens/sec	Context
Llama 3 8B	Q4_K_M	~42-50	4k-16k
Mistral 7B	Q4_K_M	~40-50	4k-16k
Qwen 2.5 14B	Q4_K_M	~22-23	16k
Qwen 2.5 14B	5-bit EXL2	~30-33	8k
Phi-3 14B	Q4_K_M	~22-25	16k
Any 20B+ model	Q4	~9	Limited

The 14B sweet spot is genuinely impressive for a £200 card. Thirty tokens per second on Qwen 2.5 14B at 5-bit EXL2 – honestly, at that speed I forget I’m running it locally rather than hitting an API. The hard ceiling hits around 20B parameters, and after that you’re spilling into system RAM and it drops to single digits. Still usable if you’re batching things overnight, but not for interactive chat.

If you’re running ExLlamaV2 (which you should be for GPU-only inference on NVIDIA), the 360 GB/s memory bandwidth on the 3060 actually outperforms the RTX 4060 on token generation. Newer architecture doesn’t matter when you simply don’t have enough VRAM.

Budget King

NVIDIA RTX 3060 12GB

~£200

VRAM 12GB GDDR6
Best for 7-14B models, budget entry

“42 tok/s on Llama 3 8B”

Check Price on Amazon

New Contender

AMD RX 9060 XT 16GB

~£300

VRAM 16GB GDDR6
Best for 7-14B models, future-proof budget

“16GB handles 14B at Q8 comfortably”

Check Price on Amazon

RX 9060 XT 16GB (AMD’s New Budget Option)

Worth knowing about if you’re buying new in 2026. Sixteen gigabytes of GDDR6 on AMD’s RDNA 4 architecture for about £300. Four extra gig over the RTX 3060 means 14B models at Q8 fit comfortably, and you can squeeze 24B models in at aggressive quantisation.

The catch is software. AMD’s ROCm stack for AI workloads has improved massively over the past year, but it’s still behind CUDA in terms of compatibility and community support. Most local LLM tools work – Ollama, llama.cpp, LM Studio all support AMD now – but you’ll hit more edge cases than you would on NVIDIA. If you’re comfortable troubleshooting, the extra VRAM is worth it. If you want everything to just work first time, the RTX 3060 is still the safer buy.

RTX 3090 24GB (Used/Renewed)

Yeah, it’s two generations old. The local AI community collectively shrugged at that ages ago and kept buying them.

Twenty-four gigabytes of GDDR6X handles 34B models at Q4 or 70B at tight quantisation. I ran one of these for about a year before the Threadripper build happened, and looking back I’m slightly embarrassed at how long I underestimated what a single 24GB card could handle. Community benchmarks from Digital Spaceport show 28-36 tok/s on 14B models, 28 tok/s on Gemma 3 27B Q4. Nothing under a grand comes close to that combination of capacity and speed – and it’s got NVLink support for when you inevitably want to add a second one.

Renewed cards run £650-800 on Amazon. Most sellers give you about 90 days of warranty. Bit of a gamble, but I’ve not heard of widespread failure rates from the AI community. If you’re planning multi-GPU later, look for blower-style cards – they exhaust heat out the back instead of dumping it onto the card above. The 350W TDP per card adds up fast when you’ve got two of them in the same case.

Mini PCs for AI

This is the section that didn’t exist when I first wrote this article, and it’s probably the most important update. The whole landscape shifted when AMD shipped Strix Halo – a laptop-class chip with 128GB of unified memory that the integrated GPU can access directly. Suddenly you can run 70B models from a box that fits on a shelf and draws 120W. No discrete GPU needed.

I covered the technology in depth in my beginner’s guide to AI mini PCs and the DGX Spark, but here’s the practical buying guide.

Framework Desktop (128GB)

Honestly? If I were starting from scratch today this is probably where my money would go. AMD Ryzen AI Max+ 395, 128GB LPDDR5X unified memory, crammed into a 4.5-litre case that Framework co-designed with Cooler Master and Noctua. And because it’s Framework, the whole thing is modular – you can swap the front panel tiles, the fans, even 3D print custom bits.

96GB of that 128GB is allocatable to the GPU. Runs 70B models. Llama 4 Scout fits (just). Near-silent under inference load.

The catch: LPDDR5X prices have gone through the roof. Framework originally priced the 128GB model at $1,999 but it’s now up to around $2,459 (~£1,970) due to memory supply constraints. Still the cheapest 128GB unified memory machine you can buy, and the modular design means you’re not throwing the whole thing away when the next generation of chip arrives.

Corsair AI Workstation 300

Corsair’s answer to the same question. Same Strix Halo chip, same 128GB of LPDDR5X, but in Corsair’s own compact 4.4-litre chassis with a 300W Flex ATX PSU. Shipping now at around $2,500 (~£2,000).

Less modular than Framework, but Corsair’s build quality and cooling are proven. If you already trust Corsair hardware (and half the PC gaming community does), this is the path of least resistance into unified memory AI.

GMKtec EVO-X2

The surprise entry that started this whole mini PC category. Same AMD Ryzen AI Max+ 395, same 128GB option, slightly different cooling approach. Around £2,000-2,500 on Amazon.

It was the first to market and the early reviews are solid. Speed won’t match a 3090, mind you – somewhere around 10-15 tok/s on 27B models from what I’ve seen. For an always-on inference box that handles 70B from under your desk without waking the house though, I haven’t found anything else in this bracket. Plus I expect to see gen 1 Strix Halo mini PCs on eBay for £500-600 in two years’ time once the next chip generation lands.

Framework Desktop

£1,970

128GB unified memory
AMD Strix Halo
4.5L case

“Modular, repairable”

View at Framework →

Corsair AI Workstation 300

~£2,000

128GB unified memory
AMD Strix Halo
4.4L case

“Shipping now”

View at Corsair →

GMKtec EVO-X2

£2,000 – £2,500

128GB unified memory
AMD Strix Halo
Mini PC form factor

“96GB allocatable VRAM”

Check Price on Amazon →

NVIDIA DGX Spark

£3,800+

128GB unified memory
Grace Blackwell
Desktop form factor

“1 petaFLOP FP4”

Check Price on Amazon →

Mid-Range: £800 – £3,000

Mac Mini M4 Pro (24GB)

Apple’s cheapest route into unified memory for AI work. Twenty-four gig of unified memory, which handles 13-14B models nicely through MLX. Slower on raw tok/s than a 3090, but the software side is painless – Ollama runs natively, no CUDA drivers to wrestle with. £1,399 on Amazon.

Not going to touch 70B, not remotely. But for 7-14B work – coding assistants, summarisation, local chatbots – genuinely lovely quiet machine that does exactly what you’d want. If you’re on macOS already and want to dip a toe into local inference, this is probably where I’d point you first.

RTX 4000 Ada (Workstation, 20GB)

I run these in my own rig and they’ve been brilliant. Single-slot form factor at 130W per card, twenty gig of VRAM each. Stick four of them in a standard workstation case and you’re sitting on 80GB total at 520W combined – more than enough for 70B models at Q5 with headroom left over for context windows.

I’ve got six in my Threadripper 5990x (mixed with RTX 4500 Adas) for 104GB total. Quiet enough to sit in my office all day, which was the main engineering constraint because I’m working next to it eight hours a day. The whole system pulls about 800W under full inference load – sounds like a lot until you compare it with a quad 3090 setup drawing 1,400W. Raw tok/s per card is lower than gaming GPUs, but the density and power efficiency are what sold me for a machine that runs continuously. About £1,150 each on Amazon.

Dual RTX 3090 Build

The prosumer sweet spot for people who want 70B models on NVIDIA hardware. Two 3090s together give you 48GB of total VRAM. Bridge them with NVLink and you’re looking at 15-20 tok/s on 70B Q4. Skip the bridge and it drops to 10-14 tok/s, which sounds bad until you actually try it – still plenty fast enough to hold a conversation with a model.

Build essentials: the pair of cards will set you back £1,300-1,500 used. The PSU situation gets interesting because each card wants 350W under load, so budget for a 1,200-1,600W unit. For the platform, Threadripper or HEDT gives you full x16/x16 PCIe bandwidth – consumer boards like Z790 or X670E split to x8/x8, which works but costs some throughput. An NVLink bridge runs about £40-60 used. Whole thing comes in at £1,800-2,200 depending on your platform choice, and no, nobody sells this as a pre-built – you’re getting your hands dirty.

Premium: £3,000+

RTX 5090 (32GB)

The biggest single card you can walk into a shop and buy. Thirty-two gigabytes of GDDR7, 512-bit bus, Blackwell architecture – and for the first time, a quantised 70B model actually fits on one card. No sharding, no NVLink, no dual-GPU headaches. One slot, done.

Bad news on pricing, though. The £1,799 MSRP is a fantasy at this point – GDDR7 supply constraints and AI demand have pushed real UK street prices to £2,899 for the cheapest models (Zotac Solid from Overclockers UK) and up to £3,500-4,400 for premium cards from ASUS and MSI. Used models are hovering around £2,700 on eBay. Budget £3,000 minimum and don’t expect it to improve before mid-2026.

The 575W TDP is substantial, too – make sure your PSU can handle it before you get excited and order one.

Interesting side note: Gigabyte launched the AORUS RTX 5090 AI Box – an external GPU enclosure with Thunderbolt 5 that’s specifically marketed for AI workloads. If you’ve got a laptop with Thunderbolt 5, you could run 70B models through an external box. Haven’t tested it myself but the concept is right.

NVIDIA DGX Spark

NVIDIA’s “personal AI supercomputer” that I covered in detail in my beginner’s guide to AI mini PCs. The Grace Blackwell GB10 chip with 128GB of unified LPDDR5X and up to 1 petaFLOP of FP4 performance. This is the premium version of the same unified memory concept as the Strix Halo mini PCs above, but with NVIDIA’s own silicon and full CUDA stack.

Originally launched at $3,999, but NVIDIA hiked the price to $4,699 (~£3,800) in February 2026 due to memory supply constraints. Available on Amazon and direct from NVIDIA. If you want 128GB of unified memory with NVIDIA’s ecosystem rather than AMD’s, this is it – but you’re paying a significant premium over the Framework Desktop for that CUDA compatibility.

RTX 4090 (24GB)

Still the fastest card with 24GB of VRAM, and by a decent margin over the 3090 on raw tok/s. Same VRAM ceiling though, and that’s the catch – twenty-four gig is twenty-four gig regardless of what you paid. Buying new? Get this one. Buying used? The 3090 at roughly half the price gives you the same model capacity – which is the metric that matters for local AI. £1,600-2,000 on Amazon.

Mac Studio M4 Max / M3 Ultra

For running the biggest models money can buy in a desktop form factor. The M4 Max with 128GB (from £3,999) runs 70B models at high quantisation with room for long context windows. The M3 Ultra at 192GB (from £5,999) remains the capacity flagship. Apple cancelled the M4 Ultra entirely – they’re skipping straight to M5 Ultra, expected around June 2026 at WWDC. If you’re considering an Ultra, you might want to wait a couple of months.

A hundred and ninety-two gigabytes of unified memory handles 100B+ parameter models that would need a quad-GPU PC build to match. Both from apple.com only. Same trade-off as the Mac Mini: slower tok/s than NVIDIA on models that fit in NVIDIA VRAM, but a capacity ceiling nothing else touches in a quiet box.

Quad RTX 3090 Build (AM4/AM5)

Digital Spaceport validated this build: four RTX 3090s on an AM4 B550 motherboard. Ninety-six gigabytes of VRAM. We’re talking 100-180 tok/s on 12-20B models, which is absurd throughput. Price per GB of VRAM works out to roughly £30/GB – the cheapest path to serious capacity if you don’t mind some noise.

The PSU needs to be a 2,000W unit minimum and you’ll want a case with serious airflow (or an open-air test bench, which is what most people building these seem to end up with). Fair warning: your partner will comment on the noise. You’ve basically built a small datacenter that happens to live under your desk. Budget: £3,000-3,500 for GPUs plus platform.

Does the CPU Matter?

Short answer: not much for inference, and I say this as someone running a Threadripper 5990x. The GPU does almost all the work during token generation. Where the CPU matters is prompt processing (the initial “thinking” phase before the model starts responding) and if you’re offloading layers to system RAM because your model doesn’t quite fit in VRAM.

For a dedicated inference machine, any modern 6-core CPU is fine. Don’t spend £500 on a CPU when that money could go toward more VRAM. The one exception is the unified memory machines (Framework Desktop, DGX Spark) where the CPU and GPU share memory bandwidth – there, the chip choice is the whole machine.

Software Stack

Buying the hardware’s actually the easy bit – it’s the software stack where people tend to get stuck.

Ollama – installed it the day I got my first 3090, and it’s still the one I’d tell anyone to start with. The whole workflow is ollama pull llama3:70b and then you’re chatting. Quantisation handled for you, works on everything. Benchmarks I’ve looked at suggest you lose maybe 10-30% on raw throughput versus running llama.cpp bare – which sounds bad until you remember Ollama had you running models in five minutes flat while you’d still be reading llama.cpp compile flags.

LM Studio has the best GUI experience I’ve found for local models. Built-in model browser, chat-with-your-files (that’s RAG), no terminal needed. Perfect if terminals make you nervous. I use LM Studio on my inference rig alongside the houtini-lm MCP server I built for offloading work from Claude Code to cheaper models. I also wrote a full setup guide if you want to get started.

llama.cpp is the speed baseline that everything else gets measured against. More config, more control, faster output. Serious multi-GPU setups tend to run this directly rather than going through Ollama’s wrapper.

text-generation-webui – oobabooga’s project, and honestly the one that taught me most about how inference actually works. You pick between ExLlamaV2 (fastest GPU-only loader) or llama.cpp (flexible CPU offloading) depending on your hardware situation. Learning curve is real, took me a solid weekend to get comfortable, but once you’re past that you can tune everything and understand why your settings matter.

GGUF vs EXL2

Two model formats worth knowing about. GGUF runs everywhere – Macs, mixed CPU/GPU setups, systems where the model doesn’t quite fit in VRAM. Universal format. EXL2 is NVIDIA GPU-only but faster when the model fits entirely in VRAM.

On Apple Silicon: GGUF via MLX or llama.cpp. Got enough NVIDIA VRAM? EXL2 for best speed. Not sure which? GGUF. It always works.

Bar chart comparing GPU price per gigabyte of VRAM across budget, mid-range, and premium hardware options for local AI

Hardware Compared

Hardware	VRAM	Price (GBP)	tok/s (14B)	tok/s (27B+)	Best For
RTX 3060 12GB	12GB	~200	~42 (8B) / ~23 (14B)	–	Budget entry, 7-14B models
RX 9060 XT 16GB	16GB	~300	~25 est	–	Budget AMD, 14B at Q8
RTX 3090 (used)	24GB	650-800	28-36	~28	Best value for serious work
Mac Mini M4 Pro	24GB unified	1,399	~15-20 est	–	Silent macOS, 13B models
Framework Desktop	128GB unified	~1,970	~15-20 est	~10-15 est	70B mini PC, modular
Corsair AI Workstation 300	128GB unified	~2,000	~15-20 est	~10-15 est	70B mini PC, Corsair build
GMKtec EVO-X2	96GB alloc	~2,000-2,500	~15-20 est	~10-15 est	70B mini PC, first to market
RTX 4000 Ada	20GB	1,150	~20-25 est	–	Multi-GPU builds, low power
Dual 3090 (NVLink)	48GB	1,800-2,200	30+	15-20	70B models, prosumer
RTX 4090	24GB	1,600-2,000	~40-50 est	~35 est	Fastest 24GB option
RTX 5090	32GB	2,900-3,500	~50+ est	~40 est	Single-GPU 70B, no sharding
DGX Spark	128GB unified	~3,800	~15-20 est	~10-15 est	128GB NVIDIA ecosystem
Mac Studio M3 Ultra	192GB unified	5,999+	~10-15 est	~8-12 est	100B+ models, nothing else can
Quad 3090	96GB	3,000-3,500	100-180	26+	Maximum VRAM on a budget

Benchmarked figures from Digital Spaceport and Hardware Corner. Estimates marked ‘est’ from Gemini research and community reports. Mini PC tok/s varies significantly by model size and quantisation.

Prebuilt Pick

Corsair VENGEANCE a7500 AIR

$6,999

GPU RTX 5090 32GB
RAM 192GB DDR5
CPU Ryzen 9 9950X3D
Storage 6TB NVMe

“RTX 5090 + 192GB system RAM for CPU offload – the prebuilt that makes sense for AI”

Check Availability at Corsair

AI Workstation

Corsair AI Workstation 300

From $1,699

Memory Up to 128GB unified
VRAM Up to 96GB allocatable
Chip AMD Strix Halo
Form factor 4.4L compact

“70B models in a box the size of a hardback”

Check Availability at Corsair

What I’d Actually Buy

Had someone asked me this question two years ago I’d have said “whatever has the most VRAM under a grand.” My answer hasn’t really changed. Under £800, a used RTX 3090 is still the obvious play. Twenty-four gig of VRAM for under £800, NVLink ready for the inevitable second card, and enough capacity to run every model up to 34B at decent quantisation. Exactly where I started, and knowing what I know now, I’d make the same call.

If you’re on a tight budget and £800 is a stretch, the RTX 3060 12GB at £200 gets you surprisingly far. Forty-two tok/s on Llama 3 8B. Thirty tok/s on Qwen 2.5 14B with the right quantisation. Pair it with a £100 used workstation and you’re running local AI for less than two months of ChatGPT Plus.

Between £800 and £3,000, things have genuinely shifted since I first wrote this article – and it’s the mini PCs that did it. If you want 70B from a box you can hide on a shelf, the Framework Desktop at ~£1,970 for 128GB of unified memory is where I’d look first – it’s modular, it’s repairable, and when the next generation of chip arrives you can upgrade without binning the whole machine. Already on macOS and working with smaller models? Mac Mini M4 Pro. Interested in going down the same rabbit hole I went down? RTX 4000 Ada cards in a Threadripper workstation – quiet, dense, and the power draw won’t terrify you.

Above £3,000, the RTX 5090 is the obvious pick if you can find one at a sane price (budget £3,000 minimum right now) – one card, one slot, 70B without any of the multi-GPU headaches. For NVIDIA’s take on unified memory, the DGX Spark at £3,800 gives you 128GB and the full CUDA stack. For maximum capacity on a budget, the quad 3090 build on AM4 gets you 96GB of VRAM at about £30 per gigabyte. Ugly, loud, and you’d struggle to find anything with that much VRAM for less money.

Buy the most VRAM you can afford. Pretty much everything else is secondary.

Uncategorized

Claude Desktop System Requirements: Windows & macOS

April 13, 2026

Have you found yourself becoming a heavy AI user? For Claude Desktop, what hardware matters, what doesn’t, and where do Anthropic’s official specs look a bit optimistic? In this article: Official Requirements | Windows vs macOS | What Actually Matters | RAM | MCP Servers | Minimum vs Comfortable | Mistakes Official Requirements Anthropic doesn’t … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>

Buyer's Guides

Best GPUs for Running Local LLMs: Buyer’s Guide 2026

April 12, 2026

I’ve been running various LLMs on my own hardware for a while now and, without fail, the question I see asked the most (especially on Reddit) is “what GPU should I buy?” The rules for buying a GPU for AI are nothing like the rules for buying one for gaming – CUDA cores barely matter, … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>

How To Guides

A Beginner’s Guide to Claude Computer Use

April 11, 2026

I’ve been letting Claude control my mouse and keyboard on and off to test this feature for a little while, and the honest answer is that it’s simultaneously the most impressive and most frustrating AI feature I’ve used. It can navigate software it’s never seen before just by looking at the screen – but it … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>

How To Guides

A Beginner’s Guide to AI Mini PCs – Do You Need a DGX Spark?

April 11, 2026

I’ve been running a local LLM on a variety of bootstrapped bit of hardward, water-cooled 3090’s and an LLM server I call hopper full of older Ada spec GPUs. When NVIDIA, Corsair, et al. all started shipping these tiny purpose-built AI boxes – the DGX Spark, the AI Workstation 300, the Framework Desktop – I … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>

How To Guides

Content Marketing Ideas: What It Is, How I Built It, and Why I Use It Every Day

April 9, 2026

Content Marketing Ideas is the tool I’ve built to relcaim the massive amount of time I have to spend monitoring my sources for announcementsm ,ew products, release – whatever. The Problem with Content Research in 2026 Most front line content marketing workflow follows the same loop. You read a lot, you notice patterns, you get … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>

Strategy

Are Claude Skills Just an Alternative to Reading a Book or is there more than that?

April 9, 2026

I’ve too long treating skills like magic incantations of a topic that really, I don’t fully understand. I strated out not really thinking about skills or embracing them. I still don’t, fully, becuase most of what I do is command line, terminal, etc etc – I’m on top of computer use! BUT – I have … <a title="Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM" class="read-more" href="https://houtini.com/how-to-cut-your-claude-code-bill-with-houtini-lm/" aria-label="Read more about Cut Your Claude Code Token Use by Offloading Work to Cheaper Models with Houtini-LM">Read more</a>

Best PCs for Local AI: GPUs, Mini PCs & Builds Tested

The Number That Matters

Quick VRAM Guide

Local LLM Hardware Secrets

One Fast Card Beats Two Slow Ones

NVLink for Multi-GPU Fine Tuning

Apple Silicon: Capacity Over Speed

Unified Memory Changes Everything

Budget: Under £800

RTX 3060 12GB

RTX 3060 12GB: Actual Benchmarks

NVIDIA RTX 3060 12GB

AMD RX 9060 XT 16GB

RX 9060 XT 16GB (AMD’s New Budget Option)

RTX 3090 24GB (Used/Renewed)

Mini PCs for AI

Framework Desktop (128GB)

Corsair AI Workstation 300

GMKtec EVO-X2

Framework Desktop

Corsair AI Workstation 300

GMKtec EVO-X2

NVIDIA DGX Spark

Mid-Range: £800 – £3,000

Mac Mini M4 Pro (24GB)

RTX 4000 Ada (Workstation, 20GB)

Dual RTX 3090 Build

Premium: £3,000+

RTX 5090 (32GB)

NVIDIA RTX 5090 32GB

NVIDIA DGX Spark

RTX 4090 (24GB)

Mac Studio M4 Max / M3 Ultra

Quad RTX 3090 Build (AM4/AM5)

Does the CPU Matter?

Software Stack

GGUF vs EXL2

Hardware Compared

Corsair VENGEANCE a7500 AIR

Corsair AI Workstation 300

What I’d Actually Buy

Related Posts

Claude Desktop System Requirements: Windows & macOS

Best GPUs for Running Local LLMs: Buyer’s Guide 2026

A Beginner’s Guide to Claude Computer Use

A Beginner’s Guide to AI Mini PCs – Do You Need a DGX Spark?

Content Marketing Ideas: What It Is, How I Built It, and Why I Use It Every Day

Are Claude Skills Just an Alternative to Reading a Book or is there more than that?

Receive the latest articles in your inbox

Join the Houtini Newsletter