Photo by Zulfugar Karimov on Unsplash
When it comes to AI and vision, most tools can recognize the basics — faces, cats, street signs. But what if your job needs something with more brains than that? Like reading a complex product manual, extracting data from a scanned financial report, or pulling insights from a chart buried in a grainy PDF?
That’s exactly the kind of heavy-lifting that Canadian AI company Cohere is aiming for with its new model, Command R Vision.
And here’s the kicker: it runs on just two GPUs, delivers top-shelf performance, and actually understands the kind of visual data enterprises care most about.
Two GPUs, 112 Billion Parameters, and a Lot of Ambition
Photo by DESIGNECOLOGIST on Unsplash
At the core of this announcement is Command A Vision — a visual language model (VLM) built for real-world enterprise use. It’s trained on 112 billion parameters, meaning it’s huge, but optimized so efficiently that it only needs two (or fewer) GPUs to run.
That’s important for businesses that don’t want to burn through resources just to process diagrams and reports. Cohere built Command A Vision on top of its text-based Command A model, adding eyes to the brain, so to speak.
The AI can read and make sense of:
- Graphs and charts
- Diagrams in product manuals
- Scanned documents
- PDFs in multiple languages
- Photographic images useful for tasks like risk detection
It’s especially tuned for OCR (optical character recognition) and retrieval-focused tasks, which makes it a practical fit for companies that rely on visual and document-heavy data.
How It Works Under the Hood
Cohere didn’t just throw a camera on a language model and call it a day. They followed a clever blueprint similar to the Llava architecture. Essentially, the system breaks images into smaller tiles (soft vision tokens), processes those through their 111B parameter Command A text tower, and aligns them carefully with language features.
That’s not simple stuff. It took three training stages:
- Vision-language alignment – Getting the model to connect visual features with linguistic meaning
- Supervised fine-tuning (SFT) – Training it on a large set of multimodal instructions
- Reinforcement learning with human feedback (RLHF) – Fine-tuning the model to respond the way people want
The result? Command A Vision doesn’t just “see” — it comprehends.
So How Does It Stack Up?
Photo by Pawel Czerwinski on Unsplash
Cohere ran Command A Vision against some of the usual heavyweight models in nine benchmark tests. These tests included OCRBench, ChartQA, AI2D, and TextVQA — all designed to measure how well models understand and analyze visual data paired with text.
Here’s how it scored:
- Command A Vision: 83.1% (average across benchmarks)
- GPT 4.1: 78.6%
- Llama 4 Maverick: 80.5%
- Mistral Medium 3: 78.3%
That’s a solid lead, especially when you consider the hardware cost. Unlike some competitors, Command A Vision doesn’t need a server room of GPUs to get the job done. Just two should be enough.
Also worth noting: the model supports at least 23 languages and retains the text comprehension abilities of its predecessor. So even if your documents are in different scripts or mixed-language formats, it still gets the job done.
Open Weights, Lower Costs
Photo by Sincerely Media on Unsplash
One of the interesting angles here is Cohere’s decision to offer Command A Vision as an open weights system. Basically, enterprises don’t have to be locked into proprietary platforms to use it. That’s a smart move, especially as more businesses look for alternatives they can fully own and customize.
Throw in the reduced GPU requirement and you’re looking at a significant reduction in total cost of ownership. That could be a big reason why developers and enterprise teams are starting to take notice.
Final Thoughts
AI that’s good at reading pictures isn’t new. But AI that can understand the kind of visual data businesses actually rely on — and run without bleeding the budget dry? That’s worth paying attention to.
If your company deals with lots of technical diagrams, visual documentation, or multilingual PDFs, Command A Vision might be one of the more efficient tools out there right now.
And the fact that it’s already outperforming some of the biggest names in AI? That’s impressive — no hype needed.
Keywords: AI, Cohere, Command R Vision, GPUs, visual language model, enterprise AI, OCR, document recognition, cost reduction