Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)

March 31, 2026
Written By Richard Baxter

I work on the messy middle between data, content, and automation - pipelines, APIs, retrieval systems, and building workflows for task efficiency. 

I run a local copy of Qwen3 Coder Next on a machine under my desk. It pinned down a race condition in my production code that I’d missed. It also told me, with complete confidence, that crypto.randomUUID() doesn’t work in Cloudflare Workers. It does. That tension – real bugs mixed with confident nonsense – is what makes local LLM code review cool, but difficult but token-savingly useful if you know how to work with it.


The Case for a Cheaper Colleague

I’ve been building ContentMarketingIdeas.co – an editorial copilot SaaS – entirely through Claude Code. Opus writes the code, runs the tests, deploys the workers. Brilliant at that job. But every file it reads costs input tokens, and those tokens add up fast when you’re scanning eight workers with thousands of lines each.

So I started wondering: could I get something local to do the initial scan? Not to replace Opus. More like a first pair of eyes – flag the obvious stuff, catch structural issues, spot things I’ve walked past because I’m too close to the code. Then Opus only needs to read Qwen’s 1,000-token summary instead of the full 3,500 to 10,000-token file. That’s roughly 85% fewer input tokens. Not bad.

The machine running the local model is called hopper. Desktop with a decent 104gb GPU vram, and it sits dutifully next to my main workstation. LM Studio serves the model through an OpenAI-compatible HTTP API at http://hopper:1234, and Claude Code talks to it through houtini-lm – an MCP server I built specifically for this kind of delegation.

LM Studio running Qwen3 Coder Next on hopper - the local server interface showing the model loaded and API endpoint active

What ITested

Eight production files from my project. Real Cloudflare Workers code – API routes, queue consumers, enrichment pipelines, article production. Not toy examples. 67 API routes, 8 workers, D1 database, Queues, Vectorize, roughly 30 MCP tools. The kind of codebase where bugs hide in the seams between services.

One file at a time. Simple system prompt: “You are a senior TypeScript code reviewer. Review the following Cloudflare Worker code for bugs, security issues, performance problems, and code quality.” Full file in the user message. Temperature 0.2. No special instructions, no project context, no CLAUDE.md injected. Just the code and a role.

Then I QA’d every finding with Opus. Agree, disagree, partially agree – with reasoning for each. 74 findings across 8 files, each with a graded verdict. That’s the dataset.

The Numbers

54 findings across 6 files from Coder Next. Of those, 14 to 17 were actionable – a 26 to 31 percent hit rate. Reasonable. Most static analysis tools land somewhere in that range for signal-to-noise (bear in mind this is code alerady written by Opus 4.6).

And then I ran 2 files through Qwen3 Coder 30B-A3B-Instruct – a different model with aggressive 3-bit quantization. Same prompt. Same files. 20 findings, 0 to 15 percent actionable. Closer to just noise.

Bar chart comparing audit actionable hit rates - Coder Next at 26-31% versus Coder 30B-A3B at 0-15%
MetricCoder NextCoder 30B-A3B
Files reviewed62
Total findings5420
Actionable hit rate26-31%0-15%
Factual error rate5.6%10%
Avg response time~55s~76s
Best findingRace condition in Promise.allNone actionable

The quantization killed it. Coder 30B-A3B produced generic textbook findings – “could lead to” speculation, flagging API keys “passed without validation” (nonsensical for wrangler secrets), and at one point explicitly admitting it was hallucinating: “Not directly shown… but assumed.” I laughed at that one. Model quality matters more than model size for code review. Aggressive quantization strips exactly the nuance you need – the ability to tell a deliberate design decision from an actual bug.

What It Found (The Good)

The standout: a race condition in my delivery worker. I had a counter tracking header images generated inside a Promise.all map. Multiple concurrent callbacks reading and incrementing the same variable. Classic async mutation bug – could exceed the per-delivery image cap. I’d looked at that code dozens of times and never noticed. Qwen caught it in 50 seconds.

Also caught: duplicate database queries in my enrichment worker (querying user_sites twice for the same data when one cached Set would do), unprotected JSON.parse calls in synthesis that could crash the worker on corrupted data, and an XSS vector in my markdown-to-HTML converter. That last one was a nasty security hole – I wasn’t escaping HTML before applying regex formatting, so link URLs could inject scripts.

Five fixes applied to the actual codebase from this audit. Real bugs, real patches, real commits.

What It Got Wrong (The Confident Nonsense)

Three factual errors across 54 findings – a 5.6% factual error rate. Coder Next told me crypto.randomUUID() isn’t available in Cloudflare Workers. It is – it’s a standard Web Crypto API. It claimed btoa() doesn’t exist in Workers. It does – Workers implement Web APIs, not Node.js APIs. And it flagged my D1 parameterized query pattern as SQL injection, when map(() => '?').join(',') with .bind(...ids) is literally the correct D1 approach.

The pattern is predictable: Qwen doesn’t know what runtime it’s reviewing code for. It assumes Node.js conventions and flags anything that doesn’t match. Once you know this blind spot, QA becomes fast – you learn to eyeball the runtime-knowledge findings and skip them. But if you’d blindly trusted the “Critical” severity labels, you’d have wasted time “fixing” things that weren’t broken.

The other consistent weakness: security severity inflation. Qwen marked things as Critical that were actually low-risk in context. Dynamic SQL column names from a Zod-validated schema? Not injection. System-generated UUIDs from a prior DB query? Not user-controlled input. It can’t distinguish trust levels, so everything looks dangerous.

Prompt Templates That Got the Best Results

Dead simple. No elaborate instruction sets, no few-shot examples, no chain-of-thought scaffolding. Two messages: a system role and the full file.

For audit:

System: "You are a senior TypeScript code reviewer. Review the following Cloudflare Worker code for bugs, security issues, performance problems, and code quality."
User: [full file content]
Temperature: 0.2

For fixes:

System: "You are a code fixer for TypeScript Cloudflare Workers. Given specific code and the issue, produce ONLY the corrected code. No explanations."
User: [exact code block] + [issue description] + [existing pattern to follow]
Temperature: 0.1

The fix prompt included an existing pattern from the same file – “follow this try/catch + console.warn pattern that’s already used at line 166.” That pattern-matching cue made the difference between a 95% grade (JSON parse guards) and a 50% grade (the duplicate query fix, where no pattern was provided and Qwen hallucinated the entire D1 API).

The delegation model - source file flows to Qwen3 Coder Next locally, findings flow to Claude Opus for QA, producing graded results

All default LM Studio settings. No custom sampling, no repeat penalty tuning, no context window adjustments. Downloaded the model, loaded it, started sending HTTP requests. That it worked this well on defaults is part of the point – you don’t need to be a prompt engineer or a quantization nerd to get useful results from this.

Can It Fix What It Finds?

Sort of. I ran the same 5 fix prompts through both models – identical inputs, graded outputs. Fair fight.

Grouped bar chart showing fix quality by complexity - low complexity fixes score 75-95%, medium complexity drops to 40-70%
FixComplexityCoder NextCoder 30BWinner
Duplicate DB queryMedium50%40%Coder Next
XSS fixMedium70%65%Coder Next
Decrypt try/catchLow85%70%Coder Next
JSON parse guardsLow95%80%Coder Next
Hardcoded URLsLow75%75%Tie

Coder Next won every fix except the tie on hardcoded URLs. It was also 1.9x faster on average (10 seconds vs 19 seconds per fix). But the real insight is the complexity gradient: low-complexity mechanical fixes scored 75 to 95 percent. Medium-complexity fixes that required understanding the D1 API or Hono’s response patterns scored 50 to 70 percent, with production-breaking bugs in every output.

Both models hallucinated D1 API syntax. Coder Next invented for await (const row of db.query(...)). Coder 30B invented db.select(...). Neither is a real D1 method. Both used new Response(...) instead of Hono’s c.json() for error responses. The decrypt fix from Coder 30B had a scoping error that would fail to compile – appPassword declared inside a try block but used outside it.

Only one fix out of five was production-ready without manual editing: the JSON parse guards. That one scored 95% because it’s purely mechanical – copy the existing try/catch pattern, wrap the unprotected calls, add a console.warn. No API knowledge required.

The rule of thumb: if the fix is “add a guard clause following this existing pattern,” Qwen can do it. If the fix requires understanding the framework’s API surface, do it yourself or hand it to Opus.

The Token Economics

Here’s what the delegation model actually saved across the full experiment:

MetricValue
File tokens sent to Qwen (audit)~60,500
File tokens sent to Qwen (fixes)~8,000
Qwen output tokens~10,500
Opus input for QA (Qwen summaries)~10,500
Opus input avoided (raw files)~60,500
Input token savings~85%
Opus cost saved~$0.91
Qwen cost$0
Wall-clock time (all files + fixes)~15 minutes
Token delegation comparison showing 85% fewer Opus tokens - 60.5k handled locally for free versus 10.5k QA tokens to Opus

$0.91 saved. Not exactly retirement money. But this was 8 files – a full codebase review of 40 files at the same rate would save roughly $4.50 per pass, and if you’re running audits weekly or before every major deploy, that compounds. Honestly though, the bigger value isn’t the dollars. It’s the fresh perspective. Qwen found things I’d walked past because it doesn’t share my mental model of the code. It doesn’t know what I intended, so it questions everything. Sometimes that’s annoying (stop telling me console.log is bad in Workers!), sometimes it’s exactly what you need.

What Would Make This Better

Context. That’s the biggest gap. Qwen doesn’t know my project’s CLAUDE.md – the document that records every deliberate decision, every pitfall, every “we chose this on purpose” trade-off. Half the false positives would disappear if the model had that context. The other half are runtime knowledge gaps.

What would actually move the needle:

Runtime API awareness. A brief system prompt addendum: “This code runs on Cloudflare Workers. Web APIs (crypto.randomUUID, btoa, fetch, URL, TextEncoder) are available. Node.js APIs (Buffer, fs, path) are not.” That alone would eliminate the most embarrassing factual errors.

Trust level hints. Something like: “Variables named userId, briefId, siteId are system-generated UUIDs from prior database queries, not user-controlled input.” This would cut the security severity inflation in half.

Decision log injection. Feeding the relevant section of CLAUDE.md into the system prompt for each file. “maxOutputTokens is 8192 by design (session 13 cost optimization)” would prevent the model from flagging it as a risk.

The trade-off is context window – each addition eats into space for the actual code. But a 500-token preamble that prevents 10 false positives per file? Obviously worth it. Haven’t tested this yet. That’s next.

Mistakes I Made Along the Way

1. Trusting severity labels. Qwen marks things “Critical” that are low-risk in context. The SQL injection flag on a parameterized query. The “missing crypto API” on a standard Web API. I learned to read every Critical finding with skepticism and verify against the actual runtime documentation before acting.

2. Sending code snippets instead of full files. Early experiments with partial code produced worse results. The model needs to see imports, type definitions, and surrounding context to make good judgments. Full files, one at a time.

3. Expecting fixes from an auditor. Detection and remediation are different skills. Qwen is genuinely good at spotting structural problems – duplicate queries, missing error handling, race conditions. But asking it to write the corrected code is a different task with a much lower success rate, especially for anything beyond simple pattern matching.

4. Comparing models by volume instead of quality. Coder 30B produced 10 findings per file, same as Coder Next. The difference was entirely in quality – 0 to 15 percent actionable versus 26 to 31 percent. More findings doesn’t mean better review.

5. Not saving the raw prompts. I almost didn’t keep the JSON files with the exact prompts and responses. Having those replayable artifacts is what made the head-to-head comparison possible. Save everything.


So, the delegation model works. Local model for detection, frontier model for QA, your own judgment for anything above simple-complexity fixes. QA is non-negotiable – 6.8% factual error rate means you can’t skip it. But 85% token savings and a genuine race condition caught on the first pass? I’ll take that trade every time.

Related Articles

Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)

I run a local copy of Qwen3 Coder Next on a machine under my desk. It pinned down a race condition in my production code that I’d missed. It also told me, with complete confidence, that crypto.randomUUID() doesn’t work in Cloudflare Workers. It does. That tension – real bugs mixed with confident nonsense – is … <a title="Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)" class="read-more" href="https://houtini.com/using-a-local-llm-to-audit-your-codebase/" aria-label="Read more about Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)">Read more</a>

How to Make SVGs with Claude and Gemini MCP

SVG is having a moment. Over 63% of websites use it, developers are obsessed with keeping files lean and human-readable, and the community has turned against bloated AI-generated “node soup” that looks fine but falls apart the moment you try to edit it. The @houtini/gemini-mcp generate_svg tool takes a different approach – Gemini writes the … <a title="Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)" class="read-more" href="https://houtini.com/using-a-local-llm-to-audit-your-codebase/" aria-label="Read more about Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)">Read more</a>

How to Make Images with Claude and (our) Gemini MCP

My latest version of @houtini/gemini-mcp (Gemini MCP) now generates images, video, SVG and html mockups in the Claude Desktop UI with the latest version of MCP apps. But – in case you missed, you can generate images, svgs and video from claude. Just with a Google AI studio API key. Here’s how: Quick Navigation Jump … <a title="Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)" class="read-more" href="https://houtini.com/using-a-local-llm-to-audit-your-codebase/" aria-label="Read more about Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)">Read more</a>

Yet Another Memory MCP? That’s Not the Memory You’re Looking For

I was considering building my own memory system for Claude Code after some early, failed affairs with memory MCPs. In therapy we’re encouraged to think about how we think. A discussion about metacognition in a completely unrelated world sparked an idea in my working one. The Claude Code ecosystem is flooded with memory solutions. Claude-Mem, … <a title="Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)" class="read-more" href="https://houtini.com/using-a-local-llm-to-audit-your-codebase/" aria-label="Read more about Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)">Read more</a>

The Best MCPs for Content Marketing (Research, Publish, Measure)

Most front line content marketing workflow follows the same loop. Find something worth writing about, dig into what’s already ranking on your site, update or write it, run it through SEO checks, shove it into WordPress, then wait to see if anyone reads it. Just six months ago that loop was tedious tab-switching and copy-pasting. … <a title="Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)" class="read-more" href="https://houtini.com/using-a-local-llm-to-audit-your-codebase/" aria-label="Read more about Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)">Read more</a>

How to Set Up LM Studio: Running AI Models on Your Own Hardware

How does anyone end up running their own AI models locally? For me, it started because of a deep interest in GPUs and powerful computers. I’ve got a machine on my network called “hopper” with six NVIDIA GPUs and 256GB of RAM, and I’d been using it for various tasks already, so the idea of … <a title="Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)" class="read-more" href="https://houtini.com/using-a-local-llm-to-audit-your-codebase/" aria-label="Read more about Using a Local LLM to Audit Your Codebase – What Qwen3 Coder Next Catches (and Misses)">Read more</a>

Leave a Comment

Receive the latest articles in your inbox

Join the Houtini Newsletter

Practical AI tools, local LLM updates, and MCP workflows straight to your inbox.