AI assistants can remember details from conversations that took place a month ago and get the facts straight, without the company spending a fortune on computing resources. Researchers from Sberbank’s Practical Artificial Intelligence Center, the AIRI Institute and Skoltech have found a way to determine the exact moment when a neural network begins to lose meaning while conserving memory. The researchers presented their discovery in a paper at the EACL 2026 international conference in Morocco. The team used the new method to update the GigaChat AI assistant which now remembers key user data and uses it in future conversations to make communication more natural and personalized.
Although contemporary AI models can think like humans, they can be forgetful. The standard approach is to feed the entire history of a conversation, or a stack of documents, into the neural network. This works well with short texts, but as soon as the context becomes too long, the responses may fall short of expectations. The model remembers the beginning and the end, but not what happened in the middle. Scientists call this the ‘Lost in the Middle’ effect. Furthermore, processing millions of tokens requires a significant amount of video memory.
An effective method of processing large volumes of data is to compress the information and convert it into vector representations. While this approach minimizes the consumption of computational resources, exceeding compression thresholds can result in irreversible data distortion, a phenomenon known as token overflow.
Russian researchers have proposed a tool to address this issue. They developed a lightweight, trainable classifier that functions as a quality detector. The classifier checks the compressed tokens before they are fed into the large language model. If the classifier detects distortion in the meaning, it prevents the corrupted context from proceeding any further. The system can then substitute the original uncompressed text or search for new documents. Either way, incorrect data won’t make it into the generation process, users won’t receive inaccurate responses, and companies won’t waste resources on unnecessary computations.
“In retrieval-augmented generation systems, the key challenge lies in compressing long contexts while understanding at what point compression begins to destroy the information necessary for a response. In our study, we propose a method for identifying this threshold before running a large language model. This saves computational resources and makes RAG systems much more robust. The model only receives the context necessary for a correct response,” says Professor Alexander Panchenko, who heads the Natural Language Processing Laboratory at the Skoltech AI Center.
For businesses that integrate AI models into their search engines, assistants, or support bots, this results in direct savings. The study provides scientists with a rigorous methodology that clearly defines the boundaries of text compression and demonstrates how to detect token overflows. This increases public trust in the technology and the popularity of AI assistants. For instance, researchers used their method to update Sber’s GigaChat Ultra, teaching it to remember user information and facilitating more natural and comfortable communication.
Nikolay Tiden, the director of Sberbank’s Practical Artificial Intelligence Center:
“Imagine having to summarize the results of a three-hour strategic meeting in just a few sentences. While you’ll highlight the key points, there’s a high risk of overlooking critical details or distorting the meaning. Context compression in language models works similarly. Our solution acts as a defense mechanism that automatically detects when brevity begins to compromise the model’s stability. For businesses, this means achieving a new level of reliable AI with lower costs and more accurate decision-making.”