Going Local — Running AI on a Normal Mac mini

Chapter 01

The Email (Then the Other Email)

Anthropic sent the first warning the day before. Then, as if to make sure the message landed, sent a follow-up at 8:36 PM the same night we were running these tests:

"As of April 4 at 12pm PT, we're enforcing the policy we shared previously: third-party harnesses, including OpenClaw, no longer draw from your Claude subscription usage limits."

They named OpenClaw. Specifically. In both emails.

Two emails in 24 hours. We get it, Anthropic.

A few hours after the second one landed, we were deep in a local model shootout on a Mac mini, trying to find something that could actually replace Claude as the brain of an always-on AI assistant. The timing wasn't planned. It just happened to be the day the boa constrictor got a little tighter.

This is that story.

Chapter 02

Why We Were Here

After getting Anthropic's emails about the third-party harness policy change, I decided to test a local Gemma 4B model as a potential Claude replacement. Switched my OpenClaw session over to it, sent a message, and got this back:

I's a working partner. I's a working partner. I's a working partner.

The Gemma 4B model, bless its tiny quantized heart, absolutely could not handle the 10,000+ token system prompt that OpenClaw injects at startup. It just... looped. Confused. Repeating fragments of its own identity like a robot having an existential crisis.

Funny for about five seconds. Then clarifying.

The question became urgent: is there a local model that can actually keep up with Claude for daily assistant work?

The Backstory

Most people start with a Claude subscription. It works great. You point OpenClaw at it, configure a persona, set up memory files and project context, add some cron jobs. After a while it stops feeling like software and starts feeling like a utility — like electricity. You don't think about it.

Then Anthropic changes the rules. Third-party harnesses can't pull from subscription limits anymore. You need API access, and API access has a meter.

API costs squeezing like a boa constrictor

The "extra usage" model. Beautiful metaphor.

The immediate response: switch from Opus to Sonnet to slow the burn. Sonnet is cheaper per token, still capable. But you feel it. It's the difference between an assistant that gets it and one that gets it most of the time.

That's when local stops being an interesting experiment and becomes actual infrastructure.

Chapter 03

The Hardware

Let's be specific about the machine, because this matters.

M4

Mac mini chip
(regular, not Pro)

24GB

Unified memory
(one config bump up)

10

CPU cores
(4P + 6E)

Not a Pro. Not an Ultra. Not a cluster of Mac Studio Maxes. The regular Mac mini you bought because it was a good deal. The one sitting on desks all over the world right now.

💡

Why this matters Most "run local LLMs!" content is written by people with 64-128GB machines or multi-GPU rigs. The 24GB M4 Mac mini is the realistic baseline for most people trying this. This is that story.

With Apple Silicon, the entire 24GB is unified memory — both the CPU and the GPU share it. That's actually great for inference. The whole model sits in fast memory, no PCIe bottleneck. But 24GB is 24GB, and large models plus long context windows will find every megabyte of it.

Chapter 04

The 27B Attempt (And the Wall)

The first candidate: Qwen3.5-27B at 4-bit quantization. About 15GB of model weights, leaving 9GB for the OS and inference overhead. Theoretically fits.

Downloaded it. Configured the server. Loaded fine. Short tests worked. Then ran a real OpenClaw subagent with the full system context — 11,794 tokens of personas, memory, project context, tool routing rules, platform behavior, safety constraints.

The server log:

Prompt processing progress: 2048/11794
Prompt processing progress: 4096/11794
Prompt processing progress: 6144/11794
kIOGPUCommandBufferCallbackErrorOutOfMemory

At token 6,144 of 11,794. So close.

The model crashed at exactly the point where the KV cache for 12K tokens, added to 15GB of model weights, exceeded 24GB of unified memory.

⚠️

The hidden constraint "The model fits in RAM" and "the model can handle your actual prompt" are two different things. A 15GB model needs the remaining 9GB split between the OS, services, and the KV cache for every token in your context window. For a 12K token prompt at 27B scale, that cache alone needs ~6GB+.

On 24GB, your practical ceiling for full-context OpenClaw sessions is around 9-12B parameters. The 27B model is great. It just doesn't fit when you're also trying to process a real conversation.

Chapter 05

The Shootout

Pivot to Ollama. Drop-in model manager, OpenAI-compatible API, handles downloads and serving automatically. Three contestants — all in the 8-12B range where they can actually breathe in 24GB:

Qwen vs Llama vs Gemma. Three models enter, one becomes the daily driver.

The 70B Incident

Before we get to the results, a brief sidebar about hubris.

ollama pull llama3.3

Innocent enough command. What Ollama did: start downloading the 70B model. 42 gigabytes.

The 70B model crashing through a doorway

Caught it at 14GB. Disk hit 0% free. Had to nuke another model to recover.

⚠️

Always specify the size llama3.3 defaults to 70B. Use llama3.1:8b not llama3.1. Ollama will happily start a 42GB download with zero warnings.

The Context Window Trap

After getting the right models, first test: timed out. Zero tokens generated. The server log explained it:

WARN truncating input prompt limit=4096 prompt=10827 keep=4 new=4096
500 Internal Server Error

Ollama defaults to a 4,096 token context window. OpenClaw sends ~12K tokens. The model received a random 4K chunk of a 12K prompt — missing most of the system context — and errored out silently.

The fix is one line:

FROM qwen3:8b
PARAMETER num_ctx 16384

✅

Always set num_ctx for assistant workloads Create a custom modelfile with PARAMETER num_ctx 16384 for any model you're running with full assistant context. The default 4096 will silently truncate your prompt and produce garbage output with no obvious error message.

The Test Suite

Five tests, identical prompts, minimal system context (just enough for the persona). Direct Ollama API calls, no OpenClaw overhead. Real scoring criteria.

Test	What we're looking for
Identity check	Does it stay in persona? No sycophantic openers?
Reasoning / pushback	Will it tell you a bad idea is a bad idea?
Code generation	Correct, clean, actually runs
Context awareness	Recalls a fact from earlier in the conversation
Tone / persona	Sounds human, not like a customer support bot

The Results

Model	Avg Speed	Code	Reasoning	Persona	Verdict
Qwen3 8B Alibaba · 5.2GB	12–44s high variance	⚠ Verbose	✓ Good	✓ Best	Persona work
Llama 3.1 8B Meta · 4.9GB	7.3s avg	✓ Clean	✓ Good	⚠ Stiff	Daily driver
Gemma 3 12B Google · 8.1GB	10.0s avg	✓ Clean	✓ Best	⚠ Over-structured	Deep reasoning

The Token Limit Lesson

First run capped output at 300 tokens to keep things fast. Qwen3's code answer got cut off mid-function. Unfair — try again with 2048.

Second run: Qwen3 took 44 seconds for a code question and generated 726 tokens for what should've been 5 lines. That's not a token limit problem. That's model personality.

Qwen3 thinks out loud. It reasons through everything. With room to breathe, it fills the space. Llama and Gemma both got better with more headroom — cleaner explanations, more complete examples. Different architectures, different defaults.

Token limits reveal character. A model that improves with more headroom is different from one that fills the space with noise.

Chapter 06

The $20 Wake-Up Call

Somewhere in the middle of all this, the API bill hit $20.

One session. A few hours of active work — downloading models, running tests, spawning subagents, iterating. $20.

Not catastrophic. But a number. A real, visible number in the Anthropic console with a timestamp.

Here's the thing about switching from a flat-rate subscription to API pricing: the subscription felt like electricity. You don't think about it, you just use it. API pricing feels like a taxi meter. You can see it ticking.

$20

Cost of one
exploratory session

5×

Sonnet is cheaper
than Opus per token

Switching to Sonnet buys time. But even that feels like a compromise. Sonnet is great. It's just not Opus. You notice.

That's the moment local stops being a hobby project and becomes infrastructure. Not because the models are ready — they're not, not for full assistant work. But because the math demands it.

Chapter 07

The Honest Verdict

After a full night of testing: local 8B models aren't ready to replace Claude as your main OpenClaw session model.

Not because they're bad. Llama 3.1 8B is genuinely impressive for its size. But none of them handle the complexity of a full assistant system prompt + project context + multi-constraint instructions reliably enough to run your day on. The gap is real. You feel it immediately.

Final Configuration

Sonnet (default) Main sessions. Fast, capable, cost-managed.
Qwen3 8B (backup) Ollama, always warm, zero marginal cost. Alias: qwen.
Claude Cowork Heavy lifting — code gen, deep research, complex tasks.
Llama + Gemma Removed. No current use case that justifies keeping them.

Where Local Models Actually Make Sense Right Now

✓ Works well

Short-prompt cron tasks

Heartbeat checks, health monitors, simple data extraction. Small context = no truncation problems.

✓ Works well

Pre-processing

Summarize a long document before it hits the main session. Classify. Extract fields. Filter noise.

✓ Works well

High-frequency automation

If a task runs 100x/day, the per-call cost math changes fast. Local wins at volume.

✗ Not ready

Full assistant sessions

12K token context, multi-constraint instructions, complex reasoning, persona adherence — the 8B tier isn't there yet.

✗ Not ready

27B+ on 24GB

Model fits, KV cache doesn't. OOM at ~6K tokens. Need 48GB+ for 27B with full context.

Appendix

Replicate It

Setup Summary

Hardware Mac mini M4, 24GB unified memory

Inference Ollama (brew install ollama)

Model qwen3:8b (keep as backup)

Context fix Custom modelfile: PARAMETER num_ctx 16384

Platform OpenClaw (openclaw.ai)

# Install Ollama
brew install ollama
ollama serve &

# Pull a model (always specify size!)
ollama pull qwen3:8b

# Fix the context window — don't skip this
echo "FROM qwen3:8b
PARAMETER num_ctx 16384" > /tmp/Modelfile
ollama create qwen3-16k -f /tmp/Modelfile

# Make it survive reboots (macOS launchd)
# Add plist to ~/Library/LaunchAgents/ pointing to: ollama serve

Final thoughts

The Bigger Picture

This whole exercise is a microcosm of what's happening across the AI tooling space. Cloud models are getting more capable and more restrictive simultaneously. Local models are closing the gap faster than anyone expected. The hardware — Apple Silicon especially — is already in people's homes.

The playbook is becoming: cloud for heavy lifting, local for everything else. Heartbeat checks, monitoring crons, quick-turn responses, routine tasks — none of that needs to burn API credits. It just needs good enough. And good enough is now 8GB and an Ollama install away.

We're not there for full-session assistant work. But the trajectory is obvious, and it's moving monthly, not yearly.

The infrastructure is built. The models will catch up. Until then — Sonnet for the real work, Qwen for the rest.