For years, transformers powered nearly every major AI breakthrough. But now, a tiny startup might have just cracked a wall that’s been blocking innovation — and they say it only cost them $4,000.
Let’s get into it.
—
From Transformers to Brumby: A Quick Recap
Back in 2017, Google introduced the now-famous “Attention Is All You Need” paper. That phrase turned into dogma. Every big-name language model — GPT, Claude, Llama, Gemini — has leaned on attention as its core mechanism ever since.
Attention lets models look across their entire input and focus on what matters. It’s powerful, but not cheap. The longer your input (say, analyzing pages of code or hours of video), the more memory and compute it devours — and it grows quadratically. That’s not sustainable, especially as we move to bigger, longer contexts.
Now, in 2025, a relatively unknown startup called Manifest AI has done something quietly radical.
They ditched attention altogether.
—
The Brumby-14B Breakaway
On October 28, Manifest AI launched Brumby-14B-Base, a retrained variant of the open-source transformer Qwen3-14B.
But here’s the twist: Brumby removed all attention layers and replaced them with something entirely new — a mechanism they call Power Retention.
This isn’t just a small tweak. It’s a full-on shift in how the model understands and processes information.
And it worked.
—
What Is Power Retention — And Why Does It Matter?
In a standard transformer, attention grabs every word (or token), compares it to every other word, and figures out what to pay attention to. It’s like holding an all-hands meeting every time someone says something. It’s thorough, but exhausting.
Power Retention works more like an ongoing conversation.
Image by Javier Segura on Unsplash
Instead of starting from scratch every time, it remembers what was already said and updates its internal state as it goes — like a running summary. This makes it much closer to how recurrent neural networks (RNNs) used to work, but with a modern twist.
How it works:
- Uses familiar transformer inputs (queries, keys, values)
- Replaces full sequence comparison with a recurrent state update
- Stores information in a memory matrix (S) updated over time
- Keeps per-token compute constant, no matter how long the input gets
So whether the input is 1,000 words or a million tokens, it costs the same to process each one. That’s huge for long documents or streaming data.
And it gets even better.
—
Performance That Holds Its Own
You might expect a radical design swap like this to tank performance. But Brumby-14B holds up shockingly well:
- On math reasoning (MATH): Brumby scored 0.62 — significantly higher than Qwen3’s 0.54 and far above others like GLM-4.5-Air (0.47)
- On the popular GSM8K dataset: Brumby slightly edges out Qwen3 (0.88 vs. 0.84)
- It does lag behind on some knowledge-heavy tasks (like MMLU-Pro), but holds near parity in most
Overall, Brumby stands toe-to-toe with heavyweight transformer models on difficult benchmarks — despite not using attention at all.
—
Only 60 Hours. Only $4,000.
This might be the part that made your jaw drop — and rightly so.
Brumby-14B was trained in just 60 hours on 32 Nvidia H100 GPUs. That’s roughly $4,000 in compute cost.
Let’s put that in perspective. Most models at this size and scale cost hundreds of thousands, more often millions.
How did they pull it off?
They didn’t start from scratch.
Instead, they took Qwen3’s pretrained weights, removed its attention layers, and retrofit it with Power Retention. Think of it like handing a trained pianist a guitar. They know music theory but need to relearn how to play.
That retraining process took just 3,000 steps — a mini crash course, not an undergrad degree.
And it worked. Brumby gradually rebuilt its performance with its new “instrument” and reached similar levels to Qwen3.
—
Hardware-Friendly and Ready to Scale
Another huge bonus of Power Retention? It plays nice with hardware.
Thanks to its linear compute cost, Brumby performs particularly well on long contexts and large sequences. Manifest AI’s own CUDA framework, Vidrial, pushes this even further — claiming 100× speed improvements in some conditions over traditional attention systems.
They’re still polishing production support, but early numbers are promising:
- Utilization: 80–85% hardware efficiency in early tests
- FLOPs and memory usage: Much lower than attention-based models
- Easier multi-user inference and context-parallel training
And while Brumby-14B is only 14 billion parameters, Manifest says this scales. They’re projecting retrain costs of $10,000–$20,000 for future 700 billion parameter models.
—
It’s Easy to Try (If You’re Already Training Models)
Here’s the kicker: converting an existing model to Brumby’s architecture is shockingly simple.
If you’ve already been fine-tuning an open-source model, here’s what you’d need to do:
pip install retention- Change one line in your architecture code
- Resume training
That’s it.
After a few hours of GPU time, you’d likely see the model regain its original performance — now with way better long-context handling and lower compute needs.
—
So Is This the End of Attention?
Not quite.
Even Jacob Buckman, Manifest AI’s founder, says: “The end of the transformer era is not yet here. But the march has begun.”
And while some critics have called out the “$4,000 foundation model” claim as misleading — noting that Brumby reused a lot of work from Qwen3 — the overall takeaway still holds up:
A new architecture. Competitive performance. A fraction of the cost.
That cracks open doors for smaller labs, startups, and even solo researchers to build serious models without ballooning budgets.
—
Final Thoughts
If attention was the first wave, Power Retention might just be the second.
Brumby-14B proves that we’re not locked into one design forever. By injecting fresh architectural ideas without starting from scratch, Manifest AI might’ve lit the spark that brings back genuine innovation to model building.
And who knows — maybe the models of the future won’t need to pay attention at all.
Keywords: Brumby AI model, Power Retention, Brumby-14B, Manifest AI, attention-free LLM, transformer alternative, Qwen3 variant, efficient language model, long context AI, affordable AI training, post-transformer architecture