Are transformers the architecture that will lead to better reasoning? And what might come next after transformers?
Transformer architecture powers the most popular public and private AI models today. We wonder then — what’s next? Is this the architecture that will lead to better reasoning? What might come next after transformers? Today, to bake intelligence in, models need large volumes of data, GPU compute power and rare talent. This makes them generally costly to build and maintain.
AI deployment started small by making simple chatbots more intelligent. Now, startups and enterprises have figured out how to package intelligence in the form of copilots that augment human knowledge and skill. The next natural step is to package things like multi-step workflows, memory and personalization in the form of agents that can solve use cases in multiple functions including sales and engineering. The expectation is that a simple prompt from a user will enable an agent to classify intent, break down the goal into multiple steps and complete the task, whether it includes internet searches, authentication into multiple tools or learning from past repeat behaviors.
These agents, when applied to consumer use cases, start giving us a sense of a future where everyone can have a personal Jarvis-like agent on their phones that understands them. Want to book a trip to Hawaii, order food from your favorite restaurant, or manage personal finances? The future of you and I being able to securely manage these tasks using personalized agents is possible, but, from a technological perspective, we are still far from that future.Is transformer architecture the final frontier?
Transformer architecture’s self-attention mechanism allows a model to weigh the importance of each input token against all tokens in an input sequence simultaneously. This helps improve a model’s understanding of language and computer vision by capturing long-range dependencies and the complex token relationships. However, it means the computation complexity increases with long sequences (ex- DNA), leading to slow performance and high-memory consumption. A few solutions and research approaches to solve the long-sequence problem include:
Improving transformers on hardware: A promising technique here is FlashAttention. This paper claims that transformer performance can be improved by carefully managing reads and writes for different levels of fast and slow memory on the GPU. It is done by making attention algorithms IO-aware which reduces the number of reads/writes between GPU’s high bandwidth memory (HBM) and static random access memory (SRAM).
Approximate attention: Self-attention mechanisms have O(n^2) complexity where n represents the length of input sequence. Is there a way to reduce this quadratic computation complexity to linear so that transformers can better handle long sequences? The optimizations here include techniques like reformer, performers, skyformer and others.
In addition to these optimizations to reduce complexity of transformers, some alternate models are challenging the dominance of transformers (but it is early days for most):
State space model: these are a class of models related to recurrent (RNN) and convolutional (CNN) neural networks that compute with linear or near-linear computational complexity for long sequences.