Baidu’s ERNIE-4.5-VL-28B-A3B-Thinking Marks a Significant Step in Open Multimodal AI

3 hours ago
2 min read

Robot with glowing blue accents in front of Baidu’s glass building. Text “ERNIE 4.5” in blue below. Futuristic and tech-focused.

Baidu has released ERNIE-4.5-VL-28B-A3B-Thinking, a multimodal model designed to interpret visual, technical, and textual data within a unified framework. The model’s capabilities, including fine-grained image understanding, diagram reasoning, and tool-assisted interactions are detailed in Baidu’s official documentation and public model card (Baidu).

Built on a Mixture-of-Experts (MoE) architecture, the model contains 28 billion total parameters, while activating only about 3 billion during inference. This design reduces computational load and enables more efficient deployment, a point emphasized in both Baidu’s model documentation and industry coverage (VentureBeat).

ERNIE-4.5 demonstrates advanced visual reasoning features such as automatic zooming for small-text extraction, structured outputs like bounding boxes and object coordinates, and the ability to integrate external tools such as image search, positioning it closer to an agentic multimodal system than a standard vision-language model (Baidu).

According to Baidu’s published benchmarks, the model achieved:

87.1 on ChartQA
82.5 on MathVista
77.3 on VLMs Are Blind

Benchmark comparison chart showing Baidu ERNIE’s visual understanding performance against Gemini 2.5 Pro and GPT-5 High. — Visual performance benchmarks comparing Baidu’s ERNIE against Gemini 2.5 Pro and GPT-5 High. *Image courtesy: Baidu Inc.*

These benchmark scores appear in Baidu’s official materials and were also reproduced in external reporting, though they remain vendor-published and not yet independently verified. Analysts note that while the results suggest strong visual reasoning performance, enterprises should validate real-world capability on domain-specific datasets (Artificial Intelligence News).

Baidu’s deployment notes indicate that the model typically requires high-memory hardware, with an 80GB GPU cited as an example configuration in its documentation.

ERNIE-4.5-VL-28B-A3B-Thinking is distributed under an Apache 2.0 open licence and is publicly accessible through major model hubs, reinforcing Baidu’s emerging position as a competitive player in open multimodal AI (VentureBeat, AI Studio, Hugging Face).

References

Baidu. (2025). ERNIE-4.5-VL-28B-A3B-Thinking [Model card]. Baidu AI Studio. https://aistudio.baidu.com/modelsdetail/39280/intro

Baidu. (2025). ERNIE-4.5-VL-28B-A3B-Thinking [Model hub page]. Hugging Face. https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking

Daws, R. (2025, November 12). Baidu ERNIE multimodal AI beats GPT and Gemini in benchmarks. Artificial Intelligence News. https://www.artificialintelligence-news.com/news/baidu-ernie-multimodal-ai-gpt-and-gemini-benchmarks/

Nuñez, M. (2025, November 12). Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini. VentureBeat. https://venturebeat.com/ai/baidu-just-dropped-an-open-source-multimodal-ai-that-it-claims-beats-gpt-5/

Ort, C. (2025, November 12). Baidu ERNIE Benchmarks: Challenging GPT-4o and Gemini. i10x. https://i10x.ai/news/baidu-ernie-benchmark-push

Wheeler, K. (2025, March 17). Baidu’s ERNIE 4.5 & X1: Redefining AI with multimodal power. Technology Magazine. https://technologymagazine.com/articles/baidus-ernie-4-5-x1-redefining-ai-with-multimodal-power

Baidu’s ERNIE-4.5-VL-28B-A3B-Thinking Marks a Significant Step in Open Multimodal AI

Recent Posts

Comments