Foundations of Large Language Models: Under-the-hood of the Transformer • Talk @ SDSU • Nov 12, 2024
Вставка
- Опубліковано 24 лис 2024
- "Foundations of Large Language Models: Under-the-hood of the Transformer Architecture" • Invited Talk at San Diego State University (@SDSU) • November 12, 2024
• Relevant Primers:
transformer.ama...
llm.aman.ai
• Overview: The talk covered the foundational principles of Large Language Models (LLMs), focusing on the Transformer architecture and its key components, including embeddings, positional encoding, self- and cross-attention mechanisms, skip connections, token sampling, and the roles of the encoder and decoder, explaining how these innovations enable efficient and context-aware language processing.
• Agenda:
➜ Transformer Overview:
Scaled dot-product attention and multi-head mechanisms for parallel processing and contextual understanding.
Handles long-range dependencies and enables parallel computation for efficient training.
➜ Input Embeddings
Embeddings reduce the dimensionality of input data, projecting words into a lower-dimensional space where similar words are closer.
Enables generalization across words with similar meanings, significantly reducing the model's parameters and required training data.
➜ Positional Encoding
Absolute positional encoding uses sinusoidal functions to encode positions, enabling models to infer token order.
Rotary Positional Embeddings (RoPE) combine absolute and relative positional benefits for long-sequence handling.
➜ Self-Attention
Maps query, key, and value vectors derived from the same sequence to calculate token relationships.
Enables dynamic weighting of token relevance, creating contextualized embeddings in parallel.
➜ Cross Attention
Bridges encoder and decoder stacks by using encoder outputs as keys and values, with decoder queries steering generation.
Essential for tasks like translation, where the target sequence depends on the source sequence.
➜ Skip/Residual Connections
Prevent gradient vanishing and ensure original input retention by adding inputs back into outputs of layers.
Improves gradient flow, avoids forgetting input tokens, and enhances training stability.
➜ Token Sampling
Converts dense de-embedded outputs into probabilities using softmax for token prediction.
Techniques like temperature scaling or top-k sampling refine the generation diversity and quality.
➜ Encoder
Stacks of self-attention and feed-forward layers encode input sequences into fixed-size representations.
Processes input tokens bidirectionally to capture complete contextual relationships.
➜ Decoder
Uses causal self-attention to ensure autoregressive token generation.
Integrates cross-attention to incorporate encoder outputs and generate coherent outputs token-by-token.
• Relevant Papers:
➜ Transformer Overview
Attention Is All You Need: arxiv.org/abs/...
➜ Input Embeddings
Efficient Estimation of Word Representations in Vector Space (Word2Vec): arxiv.org/abs/...
GloVe: Global Vectors for Word Representation: aclanthology.o...
fastText: Enriching Word Vectors with Subword Information: arxiv.org/abs/...
➜ Positional Encoding
Attention Is All You Need (Original Sinusoidal Positional Encoding): arxiv.org/abs/...
Self-Attention with Relative Position Representations: arxiv.org/abs/...
RoFormer: Enhanced Transformer with Rotary Position Embedding: arxiv.org/abs/...
➜ Self-Attention
Attention Is All You Need: arxiv.org/abs/...
Neural Machine Translation by Jointly Learning to Align and Translate (Additive Attention): arxiv.org/abs/...
➜ Cross Attention
Attention Is All You Need (Encoder-Decoder Attention): arxiv.org/abs/...
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation: aclanthology.o...
Skip/Residual Connections
Deep Residual Learning for Image Recognition (ResNet): arxiv.org/abs/...
Attention Is All You Need (Skip Connections in Transformers): arxiv.org/abs/...
➜ Token Sampling
Categorical Reparameterization with Gumbel-Softmax (Sampling methods): arxiv.org/abs/...
Decoding Strategies for Neural Machine Translation: aclanthology.o...
➜ Encoder
Attention Is All You Need (Encoder Architecture): arxiv.org/abs/...
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: arxiv.org/abs/...
➜ Decoder
Attention Is All You Need (Decoder Architecture): arxiv.org/abs/...
Language Models are Few-Shot Learners (Autoregressive Decoding in GPT-3): arxiv.org/abs/...