AI Expert Retro Talk Part V: Attention is All You Need 👀

3 min readAug 8, 2023

A deep dive into how chatGPT works, and how LLMs work under the hood

1. Introduction to Attention

Attention is a concept in deep learning that is inspired by how humans pay visual attention to different regions of an image or correlate words in a sentence. It allows us to focus on certain areas with “high resolution” while perceiving the surroundings in “low resolution.” In AI, attention can be broadly interpreted as a vector of importance weights, helping to predict or infer one element by estimating how strongly it is correlated with other elements.

2. The Seq2Seq Model and Its Limitations

The seq2seq model, born in the field of language modeling, aims to transform an input sequence to a new one. It consists of an encoder that processes the input sequence and a decoder that emits the transformed output. However, a critical disadvantage of this model is its incapability of remembering long sentences, often forgetting the first part once it processes the whole input.

3. Birth of Attention Mechanism for Translation

To overcome the limitations of the seq2seq model, the attention mechanism was introduced, particularly to help memorize long source sentences in neural machine translation (NMT). Unlike the traditional method of building a single context vector, attention creates shortcuts between the context vector and the entire source input, allowing for better alignment and memory retention.

4. Types and Definitions of Attention Mechanisms

Several popular attention mechanisms have been developed, each with unique alignment score functions:

Content-based Attention: Focuses on content similarity.
Additive Attention: Uses a feed-forward network for alignment.
Location-Based Attention: Depends on the target position.
General Attention: Utilizes a trainable weight matrix.
Dot-Product Attention: Based on dot product calculations.
Scaled Dot-Product Attention: Similar to dot-product but with a scaling factor.

5. Self-Attention and Its Applications

Self-attention, also known as intra-attention, relates different positions of a single sequence to compute a representation of the sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation, allowing for better understanding and correlation between different parts of the content.

6. Soft vs Hard Attention

Soft attention learns alignment weights over all patches in the source image, making it smooth and differentiable but computationally expensive. Hard attention, on the other hand, selects one patch at a time, making it less computationally intensive but non-differentiable. These two types offer different trade-offs in terms of performance and computational efficiency.

7. Global vs Local Attention

Global attention is similar to soft attention, attending to the entire input state space. Local attention, an interesting blend between hard and soft, predicts a single aligned position for the current target word and uses a window centered around the source position to compute a context vector. This allows for a more focused and differentiable attention mechanism.

8. Neural Turing Machines (NTM)

NTM is a model architecture that couples a neural network with external memory storage, mimicking the Turing machine’s computational model. It consists of a controller neural network and a memory bank, allowing for more complex and limitless computations.

9. Pointer Network

The Pointer Network resolves problems where output elements correspond to positions in an input sequence. It applies attention over the input elements to pick one as the output at each decoder step, allowing for more flexible and position-based outputs.

10. Transformer Model

The transformer model is built entirely on self-attention mechanisms without using sequence-aligned recurrent architecture. It includes components like multi-head self-attention, encoder, and decoder, allowing for more efficient and effective sequence-to-sequence modeling.

11. Other Developments

SNAIL: Combines self-attention with temporal convolutions, demonstrating good performance in various learning tasks.
Self-Attention GAN: Adds self-attention layers into GAN to better model relationships between spatial regions, capturing global dependencies.

Conclusion

Attention plays a crucial and multifaceted role in LLMs and AI. From enhancing prediction and inference to remembering long sequences and capturing relationships between different parts of data, attention mechanisms have revolutionized various domains. Its versatility and effectiveness have led to widespread adoption and continuous exploration, making it a cornerstone in modern AI and deep learning techniques. Whether in language translation, image recognition, or complex computations, attention provides a nuanced and powerful tool for understanding and processing information.