Maybe It’s Time To Start Looking at Local Coding Models
Speed, cost, and predictability are starting to matter more to me than top-end reasoning.
This weekend, I experimented with AI image generation for this blog. ChatGPT suggested Draw Things, which has a brew cask:
brew install drawthingsai/draw-things/draw-things-cli
ChatGPT also suggested Qwen Image as the model. The generated images were excellent–arguably on-par with ChatGPT 5.4. I was pleasantly surprised. Simon Willison put up Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 today. He compares Qwen 3.6-35B-A3B to Claude Opus 4.7, and finds that Qwen produces better images on his laptop. I raise these points because I was expecting far worse from local models.
This has me thinking I might need to try local coding models again. I’ve been thinking about a few things lately:
- GPT 5.4 xhigh is good enough to do a lot.
- Labs fiddling with model performance behind the scenes is frustrating.
- Things are going to get expensive.
GPT 5.4 xhigh is my daily driver; it’s fantastic. Even looking at the Opus 4.7 benchmarks, it’s not totally clear to me that Opus 4.7 is better than GPT 5.4 xhigh. If they didn’t include Mythos in their benchmark table, a lot of GPT 5.4’s benchmarks would be the bold ones (the winners), including:
- agentic terminal coding (Terminal-Bench 2.0)
- multidisciplinary reasoning with tools (HLE)
- agentic search
- graduate-level reasoning
But honestly, these benchmarks don’t matter that much. The labs will launch models to perform at these levels on day 1. Then, over time, they will degrade them as they shift resources and try to save on spending. The GPT 5.4 I’m using now feels worse than what I was using three weeks ago. OpenAI is about to launch Spud, so they’ve moved resources there. An AMD director noticed some similar behavior with Claude recently, unsurprisingly just before the Opus 4.7 release.
Things are going to get more expensive, too. We’ve been living in the $3 Uber-ride of the AI era. But Anthropic is going public and OpenAI will follow suit. The bottom line will matter. Anthropic’s OpenClaw pricing changes and enterprise API price shift reflect this. There’s no free lunch. (The OpenClaw drama might have more to do with OpenAI hiring Peter Steinberger, the OpenClaw creator, but still.)
All of this has me wondering if I should start experimenting with local coding models again. If Qwen’s coding agents perform anything like their image models, I would be pretty excited. I would probably still use GPT 5.4 for plans and review, but I could see myself using local models for the coding.
I guess what I’m saying is that speed, cost, and predictability are starting to matter more to me than ever-increasing top-end reasoning. As Victor Taelin posted today, “I couldn’t care less whether a model can solve erdos problem, I just need a bot that executes massive refactors for me competently, and as fast as possible.”