THE ATTENTION DISTANCE PROBLEM: THEORETICAL ANALYSIS OF LONG-RANGE DEPENDENCIES IN TRANSFORMERS
DOI:
https://doi.org/10.46121/pspc.52.2.13Abstract
Transformer-based language models have revolutionized natural language processing, but their ability to model long-range dependencies remains a significant challenge. This paper presents a theoretical investigation into what we term the "attention distance problem"—the fundamental limitations of transformer architectures in capturing relationships between tokens separated by large distances in a sequence. We formalize this problem mathematically, analyzing how attention mechanisms degrade as token distance increases, and establish theoretical bounds on information retention across layers. Our analysis reveals that the effectiveness of attention decays at different rates depending on architectural choices, with implications for model design. We introduce analytical models to examine signal propagation through transformers and identify key bottlenecks in multi-head attention mechanisms. Based on this theoretical foundation, we propose principled approaches to improve long-range modeling capabilities without increasing computational complexity. Our findings provide a unified framework for understanding various approaches to extending context length in transformers, offering insights into the design of more efficient architectures for long-context language modeling.

