Is Attention Overrated? Meet Brumby-14B, the $4,000 AI Model Built Without Attention Layers

Image by fabio on Unsplash

For years, transformers powered nearly every major AI breakthrough. But now, a tiny startup might have just cracked a wall that’s been blocking innovation — and they say it only cost them $4,000.

Let’s get into it.

—

From Transformers to Brumby: A Quick Recap

Back in 2017, Google introduced the now-famous “Attention Is All You Need” paper. That phrase turned into dogma. Every big-name language model — GPT, Claude, Llama, Gemini — has leaned on attention as its core mechanism ever since.

Attention lets models look across their entire input and focus on what matters. It’s powerful, but not cheap. The longer your input (say, analyzing pages of code or hours of video), the more memory and compute it devours — and it grows quadratically. That’s not sustainable, especially as we move to bigger, longer contexts.

Now, in 2025, a relatively unknown startup called Manifest AI has done something quietly radical.

They ditched attention altogether.

—

The Brumby-14B Breakaway

On October 28, Manifest AI launched Brumby-14B-Base, a retrained variant of the open-source transformer Qwen3-14B.

But here’s the twist: Brumby removed all attention layers and replaced them with something entirely new — a mechanism they call Power Retention.

This isn’t just a small tweak. It’s a full-on shift in how the model understands and processes information.

And it worked.

—

What Is Power Retention — And Why Does It Matter?

In a standard transformer, attention grabs every word (or token), compares it to every other word, and figures out what to pay attention to. It’s like holding an all-hands meeting every time someone says something. It’s thorough, but exhausting.

Power Retention works more like an ongoing conversation.

Image by Javier Segura on Unsplash

Instead of starting from scratch every time, it remembers what was already said and updates its internal state as it goes — like a running summary. This makes it much closer to how recurrent neural networks (RNNs) used to work, but with a modern twist.

How it works:

Uses familiar transformer inputs (queries, keys, values)
Replaces full sequence comparison with a recurrent state update
Stores information in a memory matrix (S) updated over time
Keeps per-token compute constant, no matter how long the input gets

So whether the input is 1,000 words or a million tokens, it costs the same to process each one. That’s huge for long documents or streaming data.

And it gets even better.

—

Performance That Holds Its Own

You might expect a radical design swap like this to tank performance. But Brumby-14B holds up shockingly well:

On math reasoning (MATH): Brumby scored 0.62 — significantly higher than Qwen3’s 0.54 and far above others like GLM-4.5-Air (0.47)
On the popular GSM8K dataset: Brumby slightly edges out Qwen3 (0.88 vs. 0.84)
It does lag behind on some knowledge-heavy tasks (like MMLU-Pro), but holds near parity in most

Overall, Brumby stands toe-to-toe with heavyweight transformer models on difficult benchmarks — despite not using attention at all.

—

Only 60 Hours. Only $4,000.

This might be the part that made your jaw drop — and rightly so.

Brumby-14B was trained in just 60 hours on 32 Nvidia H100 GPUs. That’s roughly $4,000 in compute cost.

Let’s put that in perspective. Most models at this size and scale cost hundreds of thousands, more often millions.

How did they pull it off?

They didn’t start from scratch.

Instead, they took Qwen3’s pretrained weights, removed its attention layers, and retrofit it with Power Retention. Think of it like handing a trained pianist a guitar. They know music theory but need to relearn how to play.

That retraining process took just 3,000 steps — a mini crash course, not an undergrad degree.

And it worked. Brumby gradually rebuilt its performance with its new “instrument” and reached similar levels to Qwen3.

—

Hardware-Friendly and Ready to Scale

Another huge bonus of Power Retention? It plays nice with hardware.

Thanks to its linear compute cost, Brumby performs particularly well on long contexts and large sequences. Manifest AI’s own CUDA framework, Vidrial, pushes this even further — claiming 100× speed improvements in some conditions over traditional attention systems.

They’re still polishing production support, but early numbers are promising:

Utilization: 80–85% hardware efficiency in early tests
FLOPs and memory usage: Much lower than attention-based models
Easier multi-user inference and context-parallel training

And while Brumby-14B is only 14 billion parameters, Manifest says this scales. They’re projecting retrain costs of $10,000–$20,000 for future 700 billion parameter models.

—

It’s Easy to Try (If You’re Already Training Models)

Here’s the kicker: converting an existing model to Brumby’s architecture is shockingly simple.

If you’ve already been fine-tuning an open-source model, here’s what you’d need to do:

pip install retention
Change one line in your architecture code
Resume training

That’s it.

After a few hours of GPU time, you’d likely see the model regain its original performance — now with way better long-context handling and lower compute needs.

—

So Is This the End of Attention?

Not quite.

Even Jacob Buckman, Manifest AI’s founder, says: “The end of the transformer era is not yet here. But the march has begun.”

And while some critics have called out the “$4,000 foundation model” claim as misleading — noting that Brumby reused a lot of work from Qwen3 — the overall takeaway still holds up:

A new architecture. Competitive performance. A fraction of the cost.

That cracks open doors for smaller labs, startups, and even solo researchers to build serious models without ballooning budgets.

—

Final Thoughts

If attention was the first wave, Power Retention might just be the second.

Brumby-14B proves that we’re not locked into one design forever. By injecting fresh architectural ideas without starting from scratch, Manifest AI might’ve lit the spark that brings back genuine innovation to model building.

And who knows — maybe the models of the future won’t need to pay attention at all.

Keywords: Brumby AI model, Power Retention, Brumby-14B, Manifest AI, attention-free LLM, transformer alternative, Qwen3 variant, efficient language model, long context AI, affordable AI training, post-transformer architecture

Is Attention Overrated? Meet Brumby-14B, the $4,000 AI Model Built Without Attention Layers

From Transformers to Brumby: A Quick Recap

The Brumby-14B Breakaway

What Is Power Retention — And Why Does It Matter?

Performance That Holds Its Own

Only 60 Hours. Only $4,000.

Hardware-Friendly and Ready to Scale

It’s Easy to Try (If You’re Already Training Models)

So Is This the End of Attention?

Final Thoughts

Leave a Comment Cancel Reply

Sign up for Newsletter

From Transformers to Brumby: A Quick Recap

The Brumby-14B Breakaway

What Is Power Retention — And Why Does It Matter?

Performance That Holds Its Own

Only 60 Hours. Only $4,000.

Hardware-Friendly and Ready to Scale

It’s Easy to Try (If You’re Already Training Models)

So Is This the End of Attention?

Final Thoughts

Must Read

Leave a Comment Cancel Reply