Home United States USA — software A Deep Dive Into the Transformer Architecture – The Development of Transformer...

A Deep Dive Into the Transformer Architecture – The Development of Transformer Models

July 23, 2020

334

In this article, take a look at the development of transformer models.
Let’s be friends: Comment (0) Join the DZone community and get the full member experience. It may seem like a long time since the world of natural language processing (NLP) was transformed by the seminal “Attention is All You Need” paper by Vaswani et al., but in fact that was less than 3 years ago. The relative recency of the introduction of transformer architectures and the ubiquity with which they have upended language tasks speaks to the rapid rate of progress in machine learning and artificial intelligence. There’s no better time than now to gain a deep understanding of the inner workings of transformer architectures, especially with transformer models making big inroads into diverse new applications like predicting chemical reactions and reinforcement learning. Whether you’re an old hand or you’re only paying attention to transformer style architecture for the first time, this article should offer something for you. First, we’ll dive deep into the fundamental concepts used to build the original 2017 Transformer. Then we’ll touch on some of the developments implemented in subsequent transformer models. Where appropriate we’ll point out some limitations and how modern models inheriting ideas from the original Transformer are trying to overcome various shortcomings or improve performance. Transformers are the current state-of-the-art type of model for dealing with sequences. Perhaps the most prominent application of these models is in text processing tasks, and the most prominent of these is machine translation. In fact, transformers and their conceptual progeny have infiltrated just about every benchmark leaderboard in natural language processing (NLP), from question answering to grammar correction. In many ways transformer architectures are undergoing a surge in development similar to what we saw with convolutional neural networks following the 2012 ImageNet competition, for better and for worse. Transformer represented as a black box. An entire sequence of (x’s in the diagram) is parsed simultaneously in feed-forward manner, producing a transformed output tensor. In this diagram the output sequence is more concise than the input sequence. For practical NLP tasks, word order and sentence length may vary substantially. Unlike previous state-of-the-art architectures for NLP, such as the many variants of RNNs and LSTMs, there are no recurrent connections and thus no real memory of previous states. Transformers get around this lack of memory by perceiving entire sequences simultaneously. Perhaps a transformer neural network perceives the world a bit like the aliens in the movie Arrival. Strictly speaking the future elements are usually masked out during training, but other than that the model is free to learn long-term semantic dependencies throughout the entire sequence. Transformers do away with recurrent connections and parse entire sequences simultaneously, sort of like the Heptapods in Arrival. You can make your own logograms using the open source python2 repository by FlxB2 (https://github.com/FlxB2/arrival_logograms). Operating as feed-forward-only models, transformers require a slightly different approach to hardware. Transformers are actually much better suited to run on modern machine learning accelerators, because unlike recurrent networks there is no sequential processing: the model doesn’t have to process a string of elements in order to develop a useful hidden cell state. Transformers can require a lot of memory during training, but running training or inference at reduced precision can help to alleviate memory requirements. Transfer learning is an important shortcut to state-of-the-art performance on a given text-based task, and quite frankly necessary for most practitioners on realistic budgets. Energy and financial costs of training a large modern transformer can easily dwarf an individual researcher’s total yearly energy consumption, at a cost of thousands of dollars if using cloud compute. Luckily, similar to deep learning for computer vision, the new skills needed for a specialized task can be transferred to large pre-trained transformers, e.g. downloaded from the HuggingFace repository. The secret sauce in transformer architectures is the incorporation of some sort of attention mechanism, and the 2017 original is no exception. To avoid confusion, we’ll refer to the model demonstrated by Vaswani et al. as either just Transformer or as vanilla Transformer to distinguish it from successors with similar names like Transformer-XL. We’ll start by looking at the attention mechanism and build outward to a high-level view of the entire model. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.