Streaming LLMs: Expanding Language Models with Attention Sync

Krishna Pullakandam
3 min readOct 24, 2023

The challenge of feeding large language models with an unlimited amount of data has been a persistent challenge in the world of artificial intelligence. While we all desire to enhance these models with more knowledge, we often encounter roadblocks that slow down performance or result in memory errors. These hurdles arise from the fundamental limitations of both GPU memory and the computational time required to process extensive data. However, a recent research project introduces a fascinating concept called “attention sync,” offering a unique solution to this conundrum.

The Challenge of Data Overflow: As we venture deeper into the world of language models, we quickly discover that the more data we attempt to feed into these models, the slower they perform. Beyond a certain point, the models run out of memory and deliver errors. This problem is not a simple one to solve, primarily due to two key reasons.

  1. GPU Memory Limitations: The first challenge stems from the finite memory of GPUs. The devices, which power the training and inference of language models, have limitations. The famous Transformer architecture, used in most of the modern large language models, increases in complexity quadratically with every new token added. This means we cannot endlessly extend the data fed into the model.
  2. Computation Time: The second challenge relates to the time required to process extensive data. Even if we could fit an enormous dataset into GPU memory regardless of cost, the computation time would be impractical and would significantly impact the user experience.

Window Attention: Up to this point, one common solution to this problem has been “window attention.” Instead of attempting to process an entire dataset, only a fixed context window is considered. While this approach ensures that relevant content is generated with good performance, it has an inherent limitation. The model loses context about the tokens that have been removed, making it challenging to remember the entire conversation.

The Advent of Attention Sync: Let us dive into an exciting approach to significantly increase the amount of data that a large language model can handle as input while maintaining high efficiency. The key concept behind this innovation is “attention sync”.

Attention sync addresses the phenomenon that even when we feed thousands of tokens to a large language model, it pays more attention and gives more weight to the initial tokens compared to those at the end. There’s a degradation of performance towards the latter part of the input. This discovery is leveraged to extend the effective context window of the model.

How Attention Sync Works: In practice, as the amount of data expands, the tokens in the middle are excluded from the memory, and the model primarily considers the initial tokens with attention sync. To maintain context, a rolling cache is introduced, which includes the latest set of tokens. The model’s ability to access these initial tokens, combined with the recent conversation held in the rolling cache, enhances its contextual understanding.

This approach effectively allows large language models to have access to a broader context without exhausting memory resources.

The Possibilities and Limitations:

While this innovation is exciting, it doesn’t eliminate context limitations. However, it opens the door to new possibilities for various applications:

  1. Long-Form Content Generation: For tasks like writing entire books, movie scripts, or series of blog posts, where a large amount of content needs to be generated, this approach can work exceptionally well, as it rarely reaches the context limit.
  2. Extended Memory for Chatbots: If you’re building a chatbot that needs to remember conversations from several months ago, this mechanism can help maintain context efficiently.

However, it’s important to note that this doesn’t mean you can feed a language model an extensive collection of research papers and expect a detailed summary. The limitation in the middle context still exists.

The Road Ahead: The introduction of attention sync is just the first step in addressing context limitations in large language models. The concept opens the door to creative solutions, and there may be more innovative ideas on the horizon.

As we continue to explore the fascinating world of artificial intelligence, let’s keep the conversation going. Share your thoughts and new ideas in the comments. There’s a bright future ahead, and together, we can unlock the full potential of language models.



Krishna Pullakandam

Content writer and AI enthusiast. I love to write about technology, business, and culture.