Foundations of Large Language Models: Under-the-hood of the Transformer • Talk @ SDSU • Nov 12, 2024

Поділитися
Вставка
  • Опубліковано 24 лис 2024
  • "Foundations of Large Language Models: Under-the-hood of the Transformer Architecture" • Invited Talk at San Diego State University (‪@SDSU‬) • November 12, 2024
    • Relevant Primers:
    transformer.ama...
    llm.aman.ai
    • Overview: The talk covered the foundational principles of Large Language Models (LLMs), focusing on the Transformer architecture and its key components, including embeddings, positional encoding, self- and cross-attention mechanisms, skip connections, token sampling, and the roles of the encoder and decoder, explaining how these innovations enable efficient and context-aware language processing.
    • Agenda:
    ➜ Transformer Overview:
    Scaled dot-product attention and multi-head mechanisms for parallel processing and contextual understanding.
    Handles long-range dependencies and enables parallel computation for efficient training.
    ➜ Input Embeddings
    Embeddings reduce the dimensionality of input data, projecting words into a lower-dimensional space where similar words are closer.
    Enables generalization across words with similar meanings, significantly reducing the model's parameters and required training data.
    ➜ Positional Encoding
    Absolute positional encoding uses sinusoidal functions to encode positions, enabling models to infer token order.
    Rotary Positional Embeddings (RoPE) combine absolute and relative positional benefits for long-sequence handling.
    ➜ Self-Attention
    Maps query, key, and value vectors derived from the same sequence to calculate token relationships.
    Enables dynamic weighting of token relevance, creating contextualized embeddings in parallel.
    ➜ Cross Attention
    Bridges encoder and decoder stacks by using encoder outputs as keys and values, with decoder queries steering generation.
    Essential for tasks like translation, where the target sequence depends on the source sequence.
    ➜ Skip/Residual Connections
    Prevent gradient vanishing and ensure original input retention by adding inputs back into outputs of layers.
    Improves gradient flow, avoids forgetting input tokens, and enhances training stability.
    ➜ Token Sampling
    Converts dense de-embedded outputs into probabilities using softmax for token prediction.
    Techniques like temperature scaling or top-k sampling refine the generation diversity and quality.
    ➜ Encoder
    Stacks of self-attention and feed-forward layers encode input sequences into fixed-size representations.
    Processes input tokens bidirectionally to capture complete contextual relationships.
    ➜ Decoder
    Uses causal self-attention to ensure autoregressive token generation.
    Integrates cross-attention to incorporate encoder outputs and generate coherent outputs token-by-token.
    • Relevant Papers:
    ➜ Transformer Overview
    Attention Is All You Need: arxiv.org/abs/...
    ➜ Input Embeddings
    Efficient Estimation of Word Representations in Vector Space (Word2Vec): arxiv.org/abs/...
    GloVe: Global Vectors for Word Representation: aclanthology.o...
    fastText: Enriching Word Vectors with Subword Information: arxiv.org/abs/...
    ➜ Positional Encoding
    Attention Is All You Need (Original Sinusoidal Positional Encoding): arxiv.org/abs/...
    Self-Attention with Relative Position Representations: arxiv.org/abs/...
    RoFormer: Enhanced Transformer with Rotary Position Embedding: arxiv.org/abs/...
    ➜ Self-Attention
    Attention Is All You Need: arxiv.org/abs/...
    Neural Machine Translation by Jointly Learning to Align and Translate (Additive Attention): arxiv.org/abs/...
    ➜ Cross Attention
    Attention Is All You Need (Encoder-Decoder Attention): arxiv.org/abs/...
    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation: aclanthology.o...
    Skip/Residual Connections
    Deep Residual Learning for Image Recognition (ResNet): arxiv.org/abs/...
    Attention Is All You Need (Skip Connections in Transformers): arxiv.org/abs/...
    ➜ Token Sampling
    Categorical Reparameterization with Gumbel-Softmax (Sampling methods): arxiv.org/abs/...
    Decoding Strategies for Neural Machine Translation: aclanthology.o...
    ➜ Encoder
    Attention Is All You Need (Encoder Architecture): arxiv.org/abs/...
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: arxiv.org/abs/...
    ➜ Decoder
    Attention Is All You Need (Decoder Architecture): arxiv.org/abs/...
    Language Models are Few-Shot Learners (Autoregressive Decoding in GPT-3): arxiv.org/abs/...

КОМЕНТАРІ •