Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now
Japanese AI Lab Sakana Eye We introduced new techniques that allow multiple large-scale language models (LLMs) to collaborate on a single task, effectively creating a “dream team” of AI agents. How was called Multi-llm AB-MCTSmodels perform trial and error and combine their own strengths to solve problems that are too complex for individual models.
For businesses, this approach provides a means to develop more robust and capable AI systems. Instead of being locked to a single provider or model, companies can dynamically leverage the best aspects of various frontier models and assign the right AI to the right parts of their tasks to achieve great results.
The power of collective intelligence
Frontier AI models are evolving rapidly. However, each model has its own distinct advantages and disadvantages derived from its own training data and architecture. It may be great in coding, but another person is great in creative writing. Researchers at Sakana Ai argue that these differences are features rather than bugs.
“We see these biases and various aptitudes not as limitations, but as valuable resources for creating collective intelligence,” the researchers say in them. Blog post. They believe that AI systems can achieve more by working together, just as humanity’s greatest achievements come from diverse teams. “By pooling intelligence, AI Systems can solve problems that cannot be overcome by a single model.”
Think longer during reasoning
Sakana AI’s new algorithm is a “scaling at inference” technique known as “test time scaling,” a highly popular research field in the past year. While the majority of AI’s focus is “training time scaling” (to train the model larger and train on a larger dataset), inference time scaling improves performance by allocating more computational resources after the model has already been trained.
One common approach is to use reinforcement learning to encourage models to generate longer, more detailed chaining (COT) sequences, as seen in popular models such as OpenAI O3 and DeepSeek-R1. Another simpler method is repeated sampling, where the model is given the same prompt multiple times, generating various potential solutions, similar to brainstorming sessions. Sakana Ai’s work combines these ideas and proceeds.
“Our framework offers a smarter, more strategic version of Best-of-N (aka, repeat sampling),” Takuya Akiba, research scientist at Sakana AI and co-author of the paper, told VentureBeat. “It complements inference techniques like long COT via RL. By dynamically selecting search strategies and the appropriate LLM, this approach maximizes performance within a limited number of LLM calls and provides better results for complex tasks.”
How adaptive branch search works
The core of the new method is an algorithm called Adaptive Branching Monte Carlo Tree Search (AB-MCTS). This allows LLM to effectively carry out trial and error by intelligently balancing two different search strategies: “deep search” and “broader search.” A deeper search involves getting promising answers and repetitive improvements, but searching for a wider means generating a whole new solution from scratch. AB-MCTS combines these approaches to not only improve a good idea, but also try new things if the system hits a dead end or discovers another promising direction.
To achieve this, the system uses Monte Carlo Tree Search (MCT), a decision-making algorithm famously used by DeepMind’s alphago. At each step, AB-MCTS uses a probabilistic model to determine whether to refine an existing solution or to generate a new solution.
The researchers took this a step further with the MultiLLM AB-MCTS. This not only determines “what to do”, but also “which” LLM should do that. At the start of the task, the system does not know which model is best for the problem. Start by trying out a balanced mix of available LLMs, and as you progress, you learn which models are more effective and allocate more workloads over time.
Test ai ‘dream team’
Researchers tested the multi-LLM AB-MCTS system ARC-AGI-2 Benchmark. ARC (Abstraction and Inference Corpus) is notoriously difficult for AI, designed to test human-like abilities to solve new visual reasoning problems.
The team used a combination of frontier models such as the O4-Mini, Gemini 2.5 Pro, and Deepseek-R1.
The model population was able to find the correct solution for over 30% of the 120 test questions. This is a score that is significantly better than any of the models that operate on their own. This system demonstrated its ability to dynamically assign optimal models to a particular problem. In tasks where there was a clear path to the solution, the algorithm quickly identified the most effective LLM and used it more frequently.

More impressively, the team observed instances in which the model solved problems previously impossible. In one case, the solution generated by the O4-MINI model was incorrect. However, the system passed this flawed DeepSeek-R1 and Gemini-2.5Pro. This allowed us to analyze, correct the error and ultimately generate the correct answer.
“This shows that multi-LLM AB-MCTS can flexibly combine frontier models to solve problems previously unsolved and push the limits of what is achievable by using LLM as collective intelligence,” the researchers write.

“In addition to the individual advantages and disadvantages of each model, hallucination tendencies can vary widely between them,” Akiba said. “By creating an ensemble with a model that is less likely to hallucinate, it is possible to achieve both powerful logic and powerful grounding to achieve the best world of both. As hallucinations are a major issue in the context of business, this approach may help mitigate it.”
From research to real-world applications
To help developers and businesses apply this technique, Sakana AI has released the underlying algorithm as an open source framework TreeQuestavailable under the Apache 2.0 license (can be used for commercial purposes). TreeQuest offers a flexible API, allowing users to implement multi-LLM AB-MCT for their own tasks with custom scoring and logic.
“While we are in the early stages of applying AB-MCT to specific business-oriented issues, our research reveals important possibilities in several areas,” Akiba said.
Beyond the ARC-AGI-2 benchmark, the team was able to successfully apply AB-MCT to tasks such as complex algorithm coding, improving the accuracy of machine learning models.
“AB-MCT is also extremely effective for problems that require repeated trial and error, such as optimizing performance metrics for existing software,” Akiba said. “For example, it can be used to automatically find ways to improve response delays in web services.”
The release of practical and open source tools could pave the way for a new class of more powerful and reliable enterprise AI applications.