Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI. learn more
As companies continue to adopt large-scale language models (LLMs) in a variety of applications, one of the key challenges they face is improving factual knowledge about the models and reducing illusions. In a new paper, researchers Meta AI suggest”Scalable memory layerThis could be one of several solutions to this problem.
A scalable memory layer adds more parameters to LLM to improve its learning power without requiring additional computing resources. This architecture is useful for applications that can reserve extra memory for factual knowledge, but also need more nimble model inference speed.
Dense layer and storage layer
Traditional language models use “dense layers” to encode vast amounts of information in parameters. In dense layers, all parameters are used to their full potential and are activated almost simultaneously during inference. Dense layers can learn complex functions, and their growth requires additional computational and energy resources.
In contrast, for simple factual knowledge, a simpler layer with an associative memory architecture would be more efficient and easier to interpret. This is the role of the memory layer. They use simple sparse activation and key-value lookup mechanisms to encode and retrieve knowledge. Although sparse layers consume more memory than dense layers, they use only a few parameters at a time, making them much more computationally efficient.
Although memory layers have been around for several years, they are rarely used in modern deep learning architectures. They are not optimized for current hardware acceleration.
Current frontier LLMs typically use some form of “mixed-of-experts” (MoE) architecture that uses mechanisms vaguely similar to memory layers. The MoE model consists of many small expert components that specialize in specific tasks. During inference, a routing mechanism determines which expert is active based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to millions of experts and provides greater control over the parameters that are activated during inference.
Memory layer upgrade
The memory layer is computationally light but memory intensive, which poses unique challenges for current hardware and software frameworks. In their paper, the Meta researchers propose several modifications to overcome these challenges and enable large-scale use.
First, the researchers configured the memory layer to be parallelizable, distributing it across multiple GPUs to store millions of key-value pairs without changing other layers in the model. I did. We also implemented a special CUDA kernel to handle high memory bandwidth operations. Additionally, we developed a parameter sharing mechanism that supports a single set of memory parameters across multiple memory layers in a model. This means that the keys and values used for lookups are shared between layers.
These changes allow you to implement a memory layer within the LLM without slowing down your model.
“A sparsely activated memory layer complements dense networks well, reducing computing load while improving knowledge acquisition capabilities,” the researchers wrote. “They scale efficiently and provide practitioners with attractive new directions for trading off memory and computing.”
To test the memory layer, the researchers modified the Llama model by replacing one or more dense layers with a shared memory layer. They compared the memory enrichment model to the dense LLM and MoE and PEER models for several tasks including answering fact-based questions, scientific and common sense world knowledge and coding.
Their findings show that the memory model significantly improves over dense baselines and competes with models that use two to four times more compute. It also matches the performance of the MoE model with the same compute budget and number of parameters. The model’s performance is particularly noticeable on tasks that require factual knowledge. For example, for factual question answering, a memory model with 1.3 billion parameters approaches the performance of Llama-2-7B trained with twice as many tokens and 10 times as many computes.
Additionally, the researchers found that the benefits of the memory model were consistent with the size of the model when scaling the experiment from 134 million parameters to 8 billion parameters.
“Given these findings, we strongly argue that a memory layer should be integrated into all next-generation AI architectures,” the researchers wrote, but added that there is still much room for improvement. Ta. “In particular, we look forward to the development of new learning methods that further advance the effectiveness of these layers, reduce forgetfulness and hallucinations, and enable continuous learning.”