Interviewing Tri Dao and Michael Poli of Together AI on the future of LLM architectures

Поділитися
Вставка
  • Опубліковано 3 тра 2024
  • The introduction to this post can be found here: www.interconnects.ai/p/llms-b...
    I've omitted the generative audio for this post due to the heavy reliance on plots and technical content. Please enjoy this interview as an entry point, and check out the post for more details.
    Michael Poli is a PhD student at Stanford and a researcher at Together AI. zymrael.github.io/
    Tri Dao is the Chief Scientist at Together AI and an incoming assistant professor at Princeton University. tridao.me/
    00:00 Introductions
    02:04 Why Attention works and may not scale
    06:15 Quadratic scaling in attention
    11:06 What is Striped Hyena
    16:30 What is Mamba
    21:00 Mamba hardware optimization
    24:15 Predictions for 2024 architectures
    29:00 More predictions for AI
    Thanks for listening!

КОМЕНТАРІ • 7

  • @MaJetiGizzle
    @MaJetiGizzle 4 місяці тому +3

    Really tight interview dude! Just subscribed.

  • @-Evil-Genius-
    @-Evil-Genius- 4 місяці тому +2

    🎯 Key Takeaways for quick navigation:
    00:02 🎙️ *Introduction to the Interview*
    - Introduction to the first interview on interconnects, aiming to bring scientific voices into the AI discourse.
    - Michael Poy and Tri Dao introduce themselves, mentioning their expertise and areas of research.
    02:27 🤖 *Why Attention Works in Transformers*
    - Discussion on why attention works in Transformers and its success factors.
    - Transformer's ability to scale well with more parameters and data.
    - General architecture that scales efficiently and is hardware-friendly.
    06:26 🔄 *Alternatives to Attention: Recurrent Neural Networks (RNNs)*
    - Overview of Recurrent Neural Networks (RNNs) and their sequential text processing.
    - Comparison of RNNs with Transformers and their limitations.
    - Mention of newer RNN architectures like RWKV and their performance.
    11:43 🚀 *Introduction to Striped Hyena (Attention Hybrid Model)*
    - Introduction to Striped Hyena as a hybrid model combining different layer categories.
    - Mention of compositional aspects and the use of various layer blocks.
    - Exploring the model grafting techniques for changing architecture during training.
    17:30 🐍 *Introduction to Mamba (State Space Model)*
    - Introduction to Mamba as a collaboration focusing on state space models for language tasks.
    - Proof of concept to show that state space models can compete with or outperform Transformers.
    - Emphasis on the potential for faster inference and different approaches to context learning.
    21:32 💻 *Implementing New CUDA Kernels for State Space Models*
    - Discussion on implementing new CUDA kernels for state space models.
    - Explanation of the importance of expressive recurrent states and the challenges in memory handling.
    - Insights into the state size in state space models and how it contributes to model expressiveness.
    23:26 🚀 *State Space HN and Efficient GPU Memory Usage*
    - Focused on making things efficient on GPUs.
    - Proposed using a faster memory (HPM) as a cache for large states, avoiding writing to GPU memory.
    - Addressed common constraints related to slow data movement, particularly with GPUs and TPUs.
    25:04 🌐 *Future Trends in 2024: Long Context Learning and Architecture Evolution*
    - Explored the limitations of mixture of expert models on TPUs due to distributed training challenges.
    - Discussed the future of long context learning and the possibility of a variety of architectures behind a task.
    - Raised questions about the dominance of attention mechanisms and the potential evolution of architectures in 2024.
    26:10 🔄 *Transformer's Continued Strength and Integration of New Ideas*
    - Affirmed Transformer's continued strength as a safe and efficient architecture.
    - Acknowledged the emergence of new ideas, such as state space HN and linear attention, and their potential integration into Transformers.
    - Suggested that alternative architectures might become components within the Transformer model.
    28:12 💡 *Attention as a Fundamental Computational Primitive*
    - Emphasized attention as a crucial and efficient computational primitive.
    - Discussed the trade-offs related to the state dimension, sequence length, and the continual innovation in architectural design.
    - Predicted that architectural design would become more complex and interesting.
    30:33 🌐 *Future Focus Areas: Data, Miniaturization, and New Applications*
    - Highlighted the ongoing importance of data in influencing model performance.
    - Explored the significance of miniaturizing architecture design for efficient iteration.
    - Expressed excitement about new applications, particularly in scientific and engineering domains beyond language.
    32:51 🎨 *Beyond Language: Excitement for Images, Entertainment, and Videos*
    - Stressed the potential value and excitement in applications beyond language, such as images, entertainment, and videos.
    - Shared personal experiments with generating DALL·E images and converting text to video.
    - Envisioned a future with multimodal content distribution and the emergence of text-to-video APIs.
    34:55 🚀 *Innovative Applications: Text-to-Voice, Multimodal Content Distribution*
    - Anticipated the development of advanced applications like text-to-video APIs.
    - Explored possibilities of generating conference videos from slide decks using AI.
    - Speculated on the transformative impact of AI systems on work, entertainment, and content creation.
    Made with HARPA AI

  • @SimonHuggins
    @SimonHuggins 4 місяці тому

    This is fascinating - had no idea about active research areas in RNNs - Makes sense that these complementary approaches will over time come together according to the use case. Everyone’s going on about 2024 being all about multi-modal, but I wonder if it will be more about multi-model - which has got to help drive the efficiencies of multi-modal. Thanks for this - I didn’t get some of the nuances, but I bookmarked this because this seems like one of those interviews worth revisiting from time-to-time. Anyway, I’m guessing with conversations like this your channel will soon be getting really good traction.

  • @420_gunna
    @420_gunna 4 місяці тому

    Damn I am in early on this channel :)
    Just came across your blog, it's great! subbed

  • @IlEagle.1G
    @IlEagle.1G 4 місяці тому

    Immediately subscribed. Good work man

  • @dr.mikeybee
    @dr.mikeybee 4 місяці тому +2

    In the context of NLP, considering that attention calculates relative importance between tokens to create a context signature or vectorized representation while ssm creates lossy compression of large context via a log signature, has testing been done concatenating or adding the two? I wonder if ssm could replace the sinewave positional encoding in transformers.

  • @DreamsAPI
    @DreamsAPI 4 місяці тому +1

    Subscribed at the mention of bring scientific discord to AI