MiniMax-01: Scaling Foundation Models with Lightning Attention - 4M tokens context window

Поділитися
Вставка
  • Опубліковано 17 січ 2025
  • arxiv: arxiv.org/pdf/...
    HF: huggingface.co...
    GitHub: github.com/Min...
    The document details MiniMax-01, a new series of large language models (LLMs) and vision-language models (VLMs) that achieve state-of-the-art performance while significantly extending context window length. This is accomplished through the innovative use of "lightning attention," a highly efficient linear attention mechanism, combined with Mixture of Experts (MoE). The models, MiniMax-Text-01 and MiniMax-VL-01, are open-sourced and boast impressive benchmark results across various tasks, showcasing superior capabilities in long-context processing, multimodal understanding, and knowledge-based reasoning.
    MiniMax-01's scaling is made possible by several architectural innovations, allowing it to handle context windows of up to 4 million tokens. Here are the key innovations:
    *Lightning Attention:* This I/O-aware, optimized implementation of TransNormer addresses the computational bottleneck of linear attention mechanisms. It uses a novel tiling technique to divide attention calculation into intra-block and inter-block computations. This allows the model to scale linearly with the input sequence length.
    *Hybrid Architecture:* MiniMax-01 uses a hybrid architecture that combines lightning attention with softmax attention and Mixture of Experts (MoE). This allows the model to balance the efficiency of linear attention with the retrieval capabilities of softmax attention. The architecture uses one transformer block with softmax attention for every seven transformer blocks with lightning attention.
    *Mixture of Experts (MoE):* The model incorporates MoE to enhance scalability and efficiency. This enables the model to have a large number of parameters while only activating a subset for each token. MiniMax-01 has 32 experts and 456 billion total parameters, but only 45.9 billion are activated for each token.
    *Optimized Parallel Strategy:* The development of an optimized parallel strategy using techniques such as expert parallel (EP), expert tensor parallel (ETP), varlen ring attention, and Linear Attention Sequence Parallelism (LASP) enables efficient training and inference. This allows the model to handle long contexts on a single machine with limited resources.
    *CUDA Kernel Optimizations:* MiniMax-01 employs a set of CUDA kernels specifically designed for lightning attention inference. This leads to a high Model Flops Utilization (MFU) on the Nvidia H20, improving the efficiency of the inference process.
    MiniMax-01, a series of large language and vision-language models, improves upon existing LLMs in several key ways:
    *1. Longer Context Window:* MiniMax-Text-01 can handle a context window of up to **1 million tokens during training and 4 million tokens during inference**, significantly exceeding the typical range of 32K to 256K tokens in most existing models. This expanded context window allows for applications like using professional books as context, assisting with entire programming projects, and maximizing the potential of in-context learning.
    *2. Linear Attention Implementation:* MiniMax-01 is the first successful large-scale implementation of linear attention.
    *3. Mixture of Experts (MoE):* To further enhance scalability and efficiency, MiniMax-01 integrates MoE with linear attention, creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token.
    *4. Computation Optimizations:* Extensive optimizations are implemented for both training and inference, ensuring efficient utilization of computational resources.
    *5. Data Quality and Training Strategy:* MiniMax-01 utilizes a rigorously curated pre-training corpus with data quality enhancement through filtering and reward-based evaluation. A three-stage training procedure is employed to extend the context window to one million tokens.
    *6. Multi-Stage Post-Training Framework:* A comprehensive post-training framework enhances the model’s performance, long-context capability, and real-world applicability.
    *7. Vision-Language Capabilities (MiniMax-VL-01):* MiniMax-VL-01 extends the language model's capabilities to visual understanding tasks through the integration of:
    A Vision Transformer (ViT) for visual encoding
    A two-layer MLP projector for image adaptation
    A dynamic resolution strategy resizing input images according to a predefined grid
    A four-stage training regimen involving visual pre-training and fine-tuning of the entire pipeline
    *8. Benchmark Performance:* MiniMax-01 achieves top-tier performance on various academic and in-house benchmarks, outperforming many existing LLMs, particularly in long-context processing.
    Created with NotebookLM

КОМЕНТАРІ •