Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
Very small language models (SLM) can outperform major language models (LLM) in inference tasks. New research Shanghai AI Research Institute. The authors show that with the right tools and test time scaling techniques, SLM with 1 billion parameters can outperform 405B LLM in complex mathematical benchmarks.
The ability to deploy SLM for complex inference tasks is extremely useful as companies are looking for new ways to use these new models in a variety of environments and applications.
Test time scaling explained
Test Time Scaling (TTS) is the process of giving LLMS to additional computing silks during inference to improve performance on various tasks. Major inference models such as Openai O1 and DeepSeek-R1 use “Internal TTS”. That is, they are trained to “think” slowly by generating a long string of thought chain (COT) tokens.
Another approach is “external TTS”, where the performance of the model is enhanced with external help (as the name suggests). External TTS is suitable for reusing inference models without further tweaking tasks. An external TTS setup usually consists of a “policy model.” This is the main LLM that generates answers and the Process Reward Model (PRM) that evaluates the answers of the policy model. These two components are combined via sampling or search methods.
The easiest setup is “Best-of-N”. The policy model generates multiple answers and the PRM selects one or more best answers to create the final response. More advanced external TTS methods use search. In “Beam Search,” the model divides the answer into multiple steps.
For each step, multiple answers are sampled and executed via PRM. Next, select one or more suitable candidates to generate the next step in the answer. Additionally, in “Dividual Validator Tree Search” (DVTS), the model generates several answers, creating more diverse candidate responses, and then synthesizes them into the final answer.
What is the right scaling strategy?
Choosing the right TTS strategy depends on multiple factors. The study authors conducted a systematic investigation into how different policy models and PRMs influence the efficiency of TTS methods.
Their findings show that efficiency is heavily dependent on the policy model and the PRM model. For example, for a small policy model, the search-based approach outweighs Best-of-n. However, for large policy models, Best-of-n is more effective as the model has better inference capabilities and does not require a reward model to validate every step of inference.
Their findings also show that an appropriate TTS strategy depends on the difficulty of the problem. For example, for small policy models with parameters less than 7B, Best-of-N is good for simple problems, while beam search is good for more difficult problems. For policy models between 7B and 32B parameters, a variety of tree searches are suitable for simple or medium problems, while beam searches are ideal for difficult problems. However, for large policy models (such as the 72B parameter), Best-of-N is the best method for all difficulty levels.
Why Small Models Beat Big Models

Based on these findings, developers can create the best TTS strategy to solve inference problems by taking into account the policy model, PRM, and problem difficulty. Masu.
For example, the researchers found that the Llama-3.2-3B model with computationally optimal TTS strategy outperforms the Llama-3.1-405b of the Math-500 and AIAME24 in two complex mathematical benchmarks. This indicates that SLM can outperform 135 times larger models when using a computationally optimal TTS strategy.
Other experiments have found that a QWEN2.5 model with 500 million parameters can outperform GPT-4O with a suitable computationally optimal TTS strategy. Using the same strategy, the 1.5B distilled version of DeepSeek-R1 outperformed O1-Preview and O1-Mini on the Math-500 and AIME24.
When considering both training and inference computational budgets, the findings show that a computationally optimal scaling strategy allows SLMS to outperform larger models with 100-1000 times fewer flops.
Researchers’ results show that computationally optimal TT significantly improves the inference ability of linguistic models. However, as the policy model grows, the improvement in TT gradually decreases.
“This suggests that the effectiveness of TTS is directly related to the inference ability of policy models,” the researchers write. “Specifically, for models with weak inference capabilities, scaling of test-time calculations leads to significant improvements, but for models with strong inference capabilities, the gain is limited.”
This study examines that SLMS can perform better than larger models when applying computationally optimal test time scaling methods. The study focuses on mathematics benchmarking, but researchers plan to expand their research to other inference tasks such as coding and chemistry.