How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more

Very small language models (SLM) can outperform major language models (LLM) in inference tasks. New research Shanghai AI Research Institute. The authors show that with the right tools and test time scaling techniques, SLM with 1 billion parameters can outperform 405B LLM in complex mathematical benchmarks.

The ability to deploy SLM for complex inference tasks is extremely useful as companies are looking for new ways to use these new models in a variety of environments and applications.

Test time scaling explained

Test Time Scaling (TTS) is the process of giving LLMS to additional computing silks during inference to improve performance on various tasks. Major inference models such as Openai O1 and DeepSeek-R1 use “Internal TTS”. That is, they are trained to “think” slowly by generating a long string of thought chain (COT) tokens.

Another approach is “external TTS”, where the performance of the model is enhanced with external help (as the name suggests). External TTS is suitable for reusing inference models without further tweaking tasks. An external TTS setup usually consists of a “policy model.” This is the main LLM that generates answers and the Process Reward Model (PRM) that evaluates the answers of the policy model. These two components are combined via sampling or search methods.

The easiest setup is “Best-of-N”. The policy model generates multiple answers and the PRM selects one or more best answers to create the final response. More advanced external TTS methods use search. In “Beam Search,” the model divides the answer into multiple steps.

For each step, multiple answers are sampled and executed via PRM. Next, select one or more suitable candidates to generate the next step in the answer. Additionally, in “Dividual Validator Tree Search” (DVTS), the model generates several answers, creating more diverse candidate responses, and then synthesizes them into the final answer.

Various test time scaling methods (source: ARXIV)

What is the right scaling strategy?

Choosing the right TTS strategy depends on multiple factors. The study authors conducted a systematic investigation into how different policy models and PRMs influence the efficiency of TTS methods.

Their findings show that efficiency is heavily dependent on the policy model and the PRM model. For example, for a small policy model, the search-based approach outweighs Best-of-n. However, for large policy models, Best-of-n is more effective as the model has better inference capabilities and does not require a reward model to validate every step of inference.

Their findings also show that an appropriate TTS strategy depends on the difficulty of the problem. For example, for small policy models with parameters less than 7B, Best-of-N is good for simple problems, while beam search is good for more difficult problems. For policy models between 7B and 32B parameters, a variety of tree searches are suitable for simple or medium problems, while beam searches are ideal for difficult problems. However, for large policy models (such as the 72B parameter), Best-of-N is the best method for all difficulty levels.

Why Small Models Beat Big Models

*SLMS surpasses large-scale models in mathematics and AIME-24 (Source: ARXIV)*

Based on these findings, developers can create the best TTS strategy to solve inference problems by taking into account the policy model, PRM, and problem difficulty. Masu.

For example, the researchers found that the Llama-3.2-3B model with computationally optimal TTS strategy outperforms the Llama-3.1-405b of the Math-500 and AIAME24 in two complex mathematical benchmarks. This indicates that SLM can outperform 135 times larger models when using a computationally optimal TTS strategy.

Other experiments have found that a QWEN2.5 model with 500 million parameters can outperform GPT-4O with a suitable computationally optimal TTS strategy. Using the same strategy, the 1.5B distilled version of DeepSeek-R1 outperformed O1-Preview and O1-Mini on the Math-500 and AIME24.

When considering both training and inference computational budgets, the findings show that a computationally optimal scaling strategy allows SLMS to outperform larger models with 100-1000 times fewer flops.

Researchers’ results show that computationally optimal TT significantly improves the inference ability of linguistic models. However, as the policy model grows, the improvement in TT gradually decreases.

“This suggests that the effectiveness of TTS is directly related to the inference ability of policy models,” the researchers write. “Specifically, for models with weak inference capabilities, scaling of test-time calculations leads to significant improvements, but for models with strong inference capabilities, the gain is limited.”

This study examines that SLMS can perform better than larger models when applying computationally optimal test time scaling methods. The study focuses on mathematics benchmarking, but researchers plan to expand their research to other inference tasks such as coding and chemistry.

Daily insights into business use cases in VB every day

If you want to impress your boss, VB Daily has it covered. From regulatory shifts to actual deployments, it provides an internal scoop on what companies are doing with generated AI, allowing you to share the biggest ROI insights.

Please read our privacy policy

Thank you for subscribing. Check out this VB newsletter.

An error has occurred.

How test-time scaling unlocks hidden reasoning abilities in small language models (and allows them to outperform LLMs)

Test time scaling explained

What is the right scaling strategy?

Why Small Models Beat Big Models

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Test time scaling explained

What is the right scaling strategy?

Why Small Models Beat Big Models

You Might Also Like

LIVE Amazon Prime Day 2025: We’re Tracking Deals & Trends All Day

100+ Best Prime Day deals to shop: Record prices on Apple, Ninja & Samsung

New 1.5B router model achieves 93% accuracy without costly retraining

Best early Prime Day smartwatch and fitness tracker deals: My 13 favorite sales live now

Jack Dorsey just released a Bluetooth messaging app that doesn’t need the internet

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Recent Posts

Recent Comments