Ad image

New 1.5B router model achieves 93% accuracy without costly retraining

9 Min Read

Need smarter insights in your inbox? Sign up for our weekly newsletter to get only the things that matter to enterprise AI, data and security leaders. Subscribe now


Researcher at Katanemolabo I introduced it Arch routera new routing model and framework designed to intelligently map user queries to the most appropriate leading language model (LLM).

For enterprises that build products that rely on multiple LLMSs, Arch-Router aims to solve key challenges. A way to point the query to the best model of the job whenever something changes, without resorting to strict logic or expensive retraining.

LLM Routing Challenges

As the number of LLMSs increases, developers are moving from a single model setup to a multi-model system that uses the unique strength of each model for specific tasks (code generation, text summary, or image editing).

LLM routing emerges as an important technique for building and deploying these systems, serving as a traffic controller that directs each user query towards the most appropriate model.

Existing routing methods generally fall into two categories: “task-based routing.” Here we have “performance-based routing” where queries are routed based on defined tasks and seek the optimal balance between cost and performance.

However, task-based routing struggles with uncertain or shifting user intent, especially in multi-turn conversations. Performance-based routing, on the other hand, is inadequately adapted to new models unless it strictly prioritizes benchmark scores, often ignores real-world user preferences and undergoes costly fine-tuning.

More fundamentally, as Katanemoo researchers point out in them paper“The existing routing approaches have limited practical use. They optimize for benchmark performance while ignoring human preferences driven by subjective evaluation criteria.”

Researchers emphasize the need for routing systems that “provide transparency, tailor subjective human preferences, and can be easily adapted as models and use cases evolve.”

A new framework for prioritized routing

To address these limitations, researchers propose a “prioritized routing” framework that matches queries to routing policies based on user-defined preferences.

In this framework, users use the Domain Action Taxonomy to define natural language routing policies. This is a two-level hierarchy that starts with a general topic (domains such as “legal” or “financial”) and then reflects how people naturally explain tasks by narrowing them down to specific tasks (actions such as “summary” or “code generation”).

Each of these policies is linked to a preferred model, allowing developers to make routing decisions based on their actual needs, rather than just benchmark scores. As the paper states, “This taxonomy serves as a mental model that helps users define clear and structured routing policies.”

The routing process occurs in two stages. First, the prioritized router model gets the complete set of user queries and policies and selects the most appropriate policy. Second, the mapping function connects the selected policy to the specified LLM.

The model selection logic is separate from the policy, so you can add, remove, or replace models simply by editing the routing policy without having to retrain or modify the router itself. This separation provides the flexibility needed for practical deployments where models and use cases are constantly evolving.

Priority-Aligned Routing Framework Source: arxiv

Policy selection is equipped with Arch-Router, a compact 1.5B parameter language model that is fine-tuned for preferential alignment routing. The Arch-Router receives the complete set of user queries and policy descriptions within its prompts. Next, generate the best matching policy identifier.

Because policies are part of the input, the system can adapt to new or changed routes during inference, without retraining, through in-document learning. This generative approach allows Arch-Router to use pre-trained knowledge to understand both query and policy semantics and process the entire conversation history at once.

A common concern about the inclusion of broad policies in prompts is the possibility of increased latency. However, researchers designed the arch router to be extremely efficient. “The length of the routing policy can be longer, but it’s easy to increase the context window for your arch router, while minimizing the impact on latency,” explains Salman Paracha, founder/CEO of Katanemo Labs. He points out that latency is driven primarily by the length of the output, and in the case of an arch router, the output is simply a short name for a routing policy such as “Image_editing” or “document_creation”.

Archlighter in action

To build an arch router, researchers fine-tuned the 1.5B parameter version of the QWEN 2.5 model to a curated dataset of 43,000 examples. We then tested performance against leading-edge proprietary models from Openai, Anthropic and Google on four public datasets designed to evaluate conversational AI systems.

The results show that the arch router has the highest overall routing score of 93.17%, surpassing all other models including the highest proprietary model, averaged 7.71%. The advantages of the model show the powerful ability to grow in longer conversations and track contexts across multiple turns.

Archi Router vs. Other Model Source: arxiv

In fact, according to Paracha, this approach has already been applied to several scenarios. For example, in open source coding tools, developers use arch routers to direct different stages of their workflow, such as “code design”, “code understanding”, and “code generation”, to the LLM that is best suited for each task. Similarly, businesses can route document creation requests to models like Claude 3.7 Sonnet and send image editing tasks to Gemini 2.5 Pro.

The system is also ideal for personal assistants in a variety of domains where users have a wide range of tasks, from text summaries to factoid queries.

This framework is integrated archKatanemo Labs’ AI-Native Proxy Server for agents enables developers to implement sophisticated traffic formation rules. For example, when integrating a new LLM, teams can send a small portion of the traffic from a particular routing policy to the new model, verify performance with internal metrics, and fully migrate traffic with confidence. The company is also working to integrate its tools with its evaluation platform to further streamline this process for corporate developers.

Ultimately, the goal is to move beyond the implementation of siloed AI. “Arch routers, and more broadly, arches, will move developers and businesses from implementing fragmented LLM to a unified, policy-driven system,” says Paracha. “In scenarios where user tasks are diverse, our framework helps transform that task and LLM fragmentation into a unified experience, making the final product feel seamless to the end users.”

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Exit mobile version