Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
The entire AI landscape returned in January 2025 after the then unknown Chinese AI startup DeepSeek (a subsidiary of Hong Kong-based quantitative analysis company High-Flyer Capital Management) unlocked our giants like Meta by revealing the powerful open-source language inference model DeepSeek R1 to the world.
As Deepseek’s usage spreads rapidly between researchers and businesses, Meta reportedly sent to panic mode When I learned that this new R1 model was trained at just a small portion of the costs of many other major models, it was still outweighing them in millions of dollars.
Meta’s generative AI strategy was based on releasing best-in-class open source models under the brand name “llama” for researchers and companies to build freely (if they have at least 700 million users, they will then contact Meta about the special paid license terms).
However, the surprisingly good performance of the Deepseek R1 on a much smaller budget has been said to have shaken up the company’s leadership and forced a sort of calculation, the last version of the llama. 3.3released just a month ago in December 2024, but it already looks outdated.
Now we know the fruit of that calculation: Today, Meta Founder and CEO Mark Zuckerberg has joined her Instagram account Let’s announce New Lama 4 Series Modeltwo of them – 40 billion parameters Llama 4 Maverick and 100 billion parameters Llama 4 Scout – can now be used by developers to download and use or tweak now llama.com and AI Code Sharing Community Hugging my face.
The second- and second-floor parameters Llama 4 Behemoth are also being previewed today. Meta blog posts about releases It said it was still trained and never showed when it would be released. (Recall parameters refer to settings that govern the behavior of the model, and generally refer to more powerful and complex models.)
One feature of the headings of these models is that they are all multimodal – they are trained so that they can receive and generate text, videos, and images (Hough Audio is not mentioned).
The other is that there is a very long context window (1 million tokens for Llama 4 Maverick and 10 million tokens for Llama 4 Scout). This corresponds to approximately 1,500 and 15,000 pages of text, respectively. This means that users can theoretically upload or paste 7,500 pages of text and receive it from a Llama 4 scout.
Here’s what we’ve learned about this release so far:
All-in mixture
All three models use the “mixed Experts (MOE)” architectural approach Popular with previous model releases from Openai Mistral combines multiple small models (“experts”) specialized in a variety of tasks, subjects and media formats into a unified overall model. Therefore, each llama 4 release is a mixture of 128 different experts, and only the experts needed for a particular task are “sharing” experts, so it is said to be more efficient to run as the whole model handles each token rather than each.
Please note the Llama 4 blog post:
As a result, all parameters are stored in memory, but only a subset of the total parameters are active during the provision of these models. This increases inference efficiency by reducing model serving costs and latency. Llama4Maverick can be run in a single [Nvidia] H100 DGX hosts have distributed inference for easy deployment or for maximum efficiency.
Both Scouts and Mavericks are open to the public for self-hosting, but no APIs or pricing tiers hosted for the official meta-infrastructure have been announced. Instead, Meta focuses on distribution through integration with WhatsApp, Messenger, Instagram and Web Open downloads and Meta AI.
META estimates the inference cost of the Llama 4 Maverick from $0.19 to $0.49 per million token (using a 3:1 blend of input and output). This makes it much cheaper than proprietary models like the GPT-4o and is estimated to cost $4.38 per 100 tokens, based on community benchmarks.
The three Llama four models, especially Maverick and Behemoth, are explicitly designed for inference, coding and step-by-step problem solving, but do not appear to show dedicated inference models such as the Openai “O” series or the Deepseek R1.
Instead, they appear to be designed to compete more directly with “Classic”, the irrational LLMS, and multimodal models such as Openai’s GPT-4O and Deepseek’s V3. I’ll do it It looks like it’s a threat to Deepseek R1 (more on this below!)
Furthermore, for the Llama 4, Meta built a custom post-training pipeline focused on inference enhancements such as:
- Removes more than 50% of “easy” prompts during monitored tweaks.
- It employs a continuous reinforcement learning loop with progressively difficult prompts.
- Enhance mathematics, logic and coding performance using Pass@K assessments and curriculum sampling.
- Implementing Metap is a new technique that allows engineers to adjust hyperparameters (such as training rates per layer) to models and apply them to other model sizes and types of tokens while maintaining the intended model behavior.
Metap is particularly interesting as it can be used to set hyperparameters on a model and acquire many other types of models to increase training efficiency.
My VentureBeat colleague and LLM expert Ben Dickson gave a comment on the new metapping technique.
This is especially important when achieving 390 TFLOPS/GPUs over 30 trillion tokens in a massive training model that uses 32K GPUs and FP8 accuracy.
In other words, researchers can communicate to the model widely how they want to act, and apply this to large and small versions of the model, as well as to various forms of media.
It’s powerful, but not yet largely Powerful – Model Family
In him Announcement video on Instagram (naturally a meta subsidiary), Meta CEO Mark Zuckerberg, said the company’s goal is to build world-leading AI, build open source, and make profitable for everyone in the world.
It is clearly a carefully expressed statement, as it is called the Lama 4 Scout, called Meta’s blog post. In that class And it’s more powerful than all previous generation llama models” (emphasis added by me).
In other words, these are very powerful models, and are closer to the top of the heap compared to other classes in the parameter size class, but they do not necessarily set up new performance records. Despite this, Meta was keen to trumpet new Llama 4 Family Beats and more on the model.
llama 4 Behemoth
- Over GPT-4.5, Gemini 2.0 Pro, and Claude Sonnet 3.7:
- Math-500 (95.0)
- GPQA Diamond (73.7)
- MMLU Pro (82.2)
Lama 4 Maverick
- GPT-4O and GEMINI 2.0 Flash with most multimodal inference benchmarks:
- Chartqa, docvqa, Mathvista, mmmu
- While competition with Deepseek v3.1 (45.8b Params) is using less than half of the active parameters (17b)
- Benchmark score:
- Chartqa: 90.0 (vs. gpt-4o’s 85.7)
- docvqa: 94.4 (vs. 92.8)
- MMLU Pro: 80.5
- Cost-effective: $0.19-$0.49 per million token

Llama 4 Scout
- Match or outperform models such as Mistral 3.1, Gemini 2.0 Flash-Lite, Gemma 3, and more.
- DOCVQA: 94.4
- MMLU Pro: 74.3
- Mathvista: 70.7
- Unparalleled 10m token context length – ideal for long documents, codebases, or multi-turn analysis
- Designed for efficient deployment on a single H100 GPU

But after all, how does the Llama 4 stack up on Deepseek?
But of course there are other classes of inference models, such as the Deepseek R1, Openai’s “O” series (such as GPT-4O), Gemini 2.0, and Claude Sonnet.
Benchmark for the benchmarked highest parameter model – Llama 4 Behemoth and compare it to the Intial Deepseek R1 release chart for the R1-32B and OpenAI O1 models, here’s how the Llama 4 Behemoth stacks:
benchmark | llama 4 Behemoth | Deepseek R1 | Openai O1-1217 |
---|---|---|---|
Math-500 | 95.0 | 97.3 | 96.4 |
GPQA Diamond | 73.7 | 71.5 | 75.7 |
mmlu | 82.2 | 90.8 | 91.8 |
What can you conclude?
- Math-500: Llama 4 Behemoth is a little Behind Deepseek R1 and Openai O1.
- GPQA Diamond: The Giants are Before Deepseek1, but behind Openai O1.
- MMLU: Both Behemoths trail, but still outperform the Gemini 2.0 Pro and GPT-4.5.
Takeout: The Deepseek R1 and Openai O1 edge Beemoth with some metrics, but the Llama 4 Behemoth is extremely competitive and plays at or near the top of the class’s Reasoning Leaderboard.
Safety and political “bias”
Meta also highlighted the alignment and safety of the model by introducing tools such as Rama Guard, Prompt Guard and Cyberseval to allow developers to detect dangerous input/output or adversarial prompts and implement Generated Attack Agent Test (GOAT) for Automatic Red Gating.
The company also claims that Llama 4 has significantly improved its “political bias” and that “specifically: [leading LLMs] Historically, it has leaned left in terms of political and social topics discussed,” says Lama 4, which excels in courting the right wing… Zuckerberg’s Republican US President Donald J. Trump embraces His party after the 2024 election.
So far, llama 4 is standing
Meta’s Llama 4 models bring together efficiency, openness and high-end performance across multimodal and inference tasks.
With Scouts and Maverick now being published and previewed as cutting-edge teacher models, the Lama ecosystem now offers a competitive, open alternative to Openai, Anthropic, Deepseek, and Google’s finest proprietary models.
Whether you’re building an Enterprise Scale Assistant, an AI Research Pipeline, or a long-context analysis tool, Llama 4 offers flexible, high-performance options with clear orientation towards your inference initial design.