Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

Поділитися
Вставка
  • Опубліковано 16 тра 2024
  • January 10, 2023
    Introduction to Transformers
    Andrej Karpathy: karpathy.ai/
    Since their introduction in 2017, transformers have revolutionized Natural Language Processing (NLP). Now, transformers are finding applications all over Deep Learning, be it computer vision (CV), reinforcement learning (RL), Generative Adversarial Networks (GANs), Speech or even Biology. Among other things, transformers have enabled the creation of powerful language models like GPT-3 and were instrumental in DeepMind's recent AlphaFold2, that tackles protein folding.
    In this speaker series, we examine the details of how transformers work, and dive deep into the different kinds of transformers and how they're applied in different fields. We do this by inviting people at the forefront of transformers research across different domains for guest lectures.
    More about the course can be found here: web.stanford.edu/class/cs25/
    View the entire CS25 Transformers United playlist: • Stanford CS25 - Transf...
    0:00 Introduction
    0:47 Introducing the Course
    3:19 Basics of Transformers
    3:35 The Attention Timeline
    5:01 Prehistoric Era
    6:10 Where we were in 2021
    7:30 The Future
    10:15 Transformers - Andrej Karpathy
    10:39 Historical context
    1:00:30 Thank you - Go forth and transform

КОМЕНТАРІ • 208

  • @sumitpawar000
    @sumitpawar000 11 місяців тому +370

    I was not aware that Megatron was attending this lecture to understand Transformers.
    He did ask some great questions 😄

  • @Kirby-Bernard
    @Kirby-Bernard 2 місяці тому +31

    I discover that the best way to understand this lecture is to study in parallel Andrej's "Let's build GPT: from scratch, in code, spelled out" UA-cam video. Browsing thru that video give me much better insight into understanding this video. He was directly coding the attention mechanism in PyTorch in that video, and it is fascinating how things just start clicking.

    • @manifestasisanubari
      @manifestasisanubari 2 місяці тому

      Thanks for the recommendation! ♥

    • @pictzone
      @pictzone 2 місяці тому +2

      how tf do some people just blatantly copy/paste another comment lol

    • @RalphDratman
      @RalphDratman 2 місяці тому +4

      Andrej's "Let's build GPT" video:
      ua-cam.com/video/kCc8FmEb1nY/v-deo.html

  • @user-ox9wq1gj5k
    @user-ox9wq1gj5k Рік тому +41

    Thank you very much! If possible, please keep posting other lectures from 2023 playlist, this is awesome! 👍

  • @jcorey333
    @jcorey333 11 місяців тому +43

    This was amazing to learn about the historical context of transformers! The audio was a bit low quality, but I'm still glad this was posted

  • @jason_huang03
    @jason_huang03 Рік тому +8

    really looking forward to the rest of vidos of 2023!

  • @briancase9527
    @briancase9527 9 місяців тому +2

    Wow, I missed this when it was contemporary; glad I found it now at least. Great video with great content! Thanks!

  • @footfunk510
    @footfunk510 11 місяців тому +1

    Thanks for the video. I look forward to watching the upcoming lectures.

  • @ahmedivy
    @ahmedivy Рік тому +38

    Pure Gold Content by a LEGEND Teacher 💖

  • @1potdish271
    @1potdish271 Рік тому +8

    Great Lecture by the Legend "Andrej karpathy".

  • @dsazz801
    @dsazz801 5 місяців тому +1

    Thank you for sharing such a great quality of lecture!

  • @lukeliem9216
    @lukeliem9216 11 місяців тому +60

    I discover that the best way to understand this lecture is to study in parallel Andrej's "Let's build GPT: from scratch, in code, spelled out" UA-cam video. Browsing thru that video give me much better insight into understanding this video. He was directly coding the attention mechanism in PyTorch in that video, and it is fascinating how things just start clicking.😇😀😀

    • @sapnilpatel1645
      @sapnilpatel1645 7 місяців тому

      True.

    • @DaveJ6515
      @DaveJ6515 7 місяців тому +1

      "All pieces clicking in place" is exactly the way I was describing the feeling to my students no later than ten minutes ago. You are definitely right.

  • @shauryaseth8859
    @shauryaseth8859 11 місяців тому +33

    Andrej is so good that we had Bane sitting in the audience asking questions

  • @jerryyang7011
    @jerryyang7011 Рік тому +82

    What a legend Andrej is - the historical context puts quite a bit of "human touch" on Transformers and AI/ML as a whole.

    • @dr.mikeybee
      @dr.mikeybee 5 місяців тому

      I always listen when Andrej talks.

    • @RalphDratman
      @RalphDratman 2 місяці тому

      @@dr.mikeybee I love Andrej

  • @Athens1992
    @Athens1992 Рік тому +32

    what better Friday night with Karpathy expalining transformers love it!!!
    good night from Greece

    • @stanfordonline
      @stanfordonline  Рік тому +14

      Hi George, thanks for watching. We will be releasing more videos from this series soon - stay tuned!

    • @Athens1992
      @Athens1992 Рік тому +1

      @@stanfordonline amazing love Karpathy teaching and how easy he made them be

    • @rajatpatel5691
      @rajatpatel5691 Рік тому +1

      @@Athens1992 total agree 💯

    • @harunyigit897
      @harunyigit897 2 місяці тому

      Good night from turket too

  • @tzz27
    @tzz27 Рік тому +15

    Always enjoy AI lectures from Stanford.❤

  • @davidsewell4999
    @davidsewell4999 11 місяців тому +43

    Is it just my audio or is Satan always the one asking questions in the audience?

  • @affrokilla
    @affrokilla 11 місяців тому +3

    Great lecture, thanks for uploading!

  • @mikeiavelli
    @mikeiavelli Рік тому +35

    Andrej starts at 10:16

  • @amywang8711
    @amywang8711 Рік тому +1

    Great, very insteresting. Thanks for providing vedioes.

  • @dr.mikeybee
    @dr.mikeybee 5 місяців тому +4

    The attention mechanism is a dual-embedding architecture. It looks at the probability of two words being next to each other -- at least it uses something like cosine similarity to compare the tokens in a sentence. That's really the basis. For sequence to sequence translation, we use the fact that language has a definite shape inside a semantic space. Once again, we use something like cosine similarity to find a context signature (vectorized representation) that is closest to the context signature of the sequence in the original language.

  • @swagatochatterjee7104
    @swagatochatterjee7104 11 місяців тому +4

    I'm a simple man. I see Andrej. I tap the video

  • @AIautopilot
    @AIautopilot 11 місяців тому +6

    This is the funniest moment from the presentation at 🤣1:00:22 . Great video, Andrej is so knowledgeable and down to earth

  • @Bharathkumar-gv4ft
    @Bharathkumar-gv4ft Рік тому +13

    Thanks for making this beautiful piece of content available to public!

    • @stanfordonline
      @stanfordonline  Рік тому +4

      Hi Bharath, awesome feedback! Thanks for watching.

    • @jimshtepa5423
      @jimshtepa5423 Рік тому +2

      @@stanfordonline do you expect all other lectures to be published on yt?

    • @stanfordonline
      @stanfordonline  Рік тому +8

      @@jimshtepa5423 Hi Jim! We have 3 more lectures that will be published in the coming days and our team is working on making the remaining lectures available.

    • @Bharathkumar-gv4ft
      @Bharathkumar-gv4ft 11 місяців тому

      @@stanfordonline That will be great! I am eagerly waiting for the "Neuroscience-Inspired Artificial Intelligence" seminar by Trenton Bricken and Will Dorrell (Mar 7)

  • @ac12484
    @ac12484 10 місяців тому +1

    Very good, thanks Andrei

  • @ericgonzales5057
    @ericgonzales5057 3 місяці тому +1

    Please make more videos like this, I need to learn more from Andrej about the code, it would help me with my project so much! i Love how he explains it and that guys question was so dumb! come on!

  • @everydaybob
    @everydaybob 10 місяців тому +11

    Guys did Andrew Ng help you with audio for this lecture? It's his trademark usually to use "state of the art" mic (filtered by a pillow)

  • @snowman2627
    @snowman2627 Рік тому +10

    Andrej the best teacher! The node graph analogy is quite intuitive.

    • @stanfordonline
      @stanfordonline  11 місяців тому +2

      Hi Hao, thanks for watching and for your comment!

  • @ehza
    @ehza 11 місяців тому +1

    Andrej is godsend!

  • @user-co6pu8zv3v
    @user-co6pu8zv3v Рік тому +1

    Great seminar!

  • @peteluo5367
    @peteluo5367 5 місяців тому

    Thanks for sharing. This is really useful for me.

  • @temiwale88
    @temiwale88 4 місяці тому

    I'm, thankfully, not lost. I'm hanging on to these bombs. Thanks Andrej!

  • @firasobeid70
    @firasobeid70 16 днів тому

    The Unbelievable effectiveness of RNNs..from that article I learned about Andrej! That helped be develop my first LM in 2020. Meticulous explanations!

  • @wildwind4725
    @wildwind4725 9 місяців тому +10

    The year is 2023, and we've AI models capable of writing a decent essay. At the same time, the audio quality in online presentations is sometimes worse than that of the Apollo mission.

  • @user-qt9hh2mp4v
    @user-qt9hh2mp4v Рік тому +1

    Great tutorial 🎉

  • @sanawarhussain
    @sanawarhussain 11 місяців тому +3

    @ 1:11:40 The guest is asking about the attention mechanism communication phase in a data that don't have edge consistencies where connections are changing for example in different molecules that could have same amount and type of atoms but the bond between them is changing. This won't work with the vanilla Transformer architecture where each token is attending to itself and all the other token. so it like a fully connected graph.
    Alternative way to process this data would be to just use GNNs with attention mechanism that respects these edge connectivities across data.
    or if one really wants to use transformer for this task we would need to incorporate this prior knowledge of graph connectivity into the transfomer. one recent paper (by Microsoft I think) that achieved this is "graphformer". Cheers
    !

  • @gdymind7021
    @gdymind7021 4 місяці тому

    Thank you fro the great pre!

  • @AI_ML_iQ
    @AI_ML_iQ 11 місяців тому +1

    Attention mechanism does not require different matrices for query and key, both in self attention and cross aeration mechanisms. See paper by R. V. R. Pandya titled "Generalized Attention Mechanism and Relative Position for Transformer" .

  • @vimukthirandika872
    @vimukthirandika872 3 місяці тому +1

    awesome!

  • @liangcheng9856
    @liangcheng9856 9 місяців тому +3

    sound quality plz.

  • @yuktikaura
    @yuktikaura 7 місяців тому +1

    Great lecture

  • @alielouafiq2552
    @alielouafiq2552 Рік тому

    OMG ! just noticed this was released today !

  • @TheBontenbal
    @TheBontenbal Рік тому +9

    Great lecture as always (except for audio ;-)) . Is there somebody who has a link to Andrej's code? Thank you.

  • @leizhang3329
    @leizhang3329 5 місяців тому +2

    This video is an introduction to transformers in the field of AI, covering their applications in natural language processing, computer vision, reinforcement learning, and more. The instructors discuss the building blocks of transformers, including attention mechanisms and the use of self-attention and multi-headed attention. They also touch on the flexibility and efficiency of transformers compared to RNNs.
    Highlights :
    This section is an introduction to the course on Transformers and the instructors.
    The course is about deep learning models that have revolutionized the field of AI.
    Transformers have been applied in various fields such as natural language processing, computer vision, reinforcement learning, biology, and robotics.
    The instructors have research interests in reinforcement learning, computer vision, NLP, and have publications in robotics and autonomous driving.
    The message passing scheme in Transformers involves nodes looking at each other, with the decoder only looking at the top nodes.
    In the cross attention with the decoder, features from the top of the encoder are consumed.
    Multi-headed attention is the application of the attention scheme multiple times in parallel.
    Self-attention refers to each node producing a key, query, and value from itself.
    The section explains the process of combining token embeddings and positional embeddings in a transformer model.
    Token embeddings and positional embeddings are added together.
    Optional dropout is applied to the set of words and their positions.
    The input is fed into blocks of transformers.
    The output of the transformer is linearly projected to obtain the probability distribution for the next word.
    The targets, offset by one in time, are used for cross-entropy loss calculation.
    The blocks in the transformer model have a communication phase and a compute phase.
    In the communication phase, nodes in the graph communicate with each other.
    The section discusses different types of transformer models and their training objectives.
    There are decoder-only models like GPT, encoder-only models like BERT, and encoder-decoder models like T5.
    BERT is trained using a different objective than language modeling, such as sentiment classification.
    Transformers are trained using masking and denoising techniques.
    The connectivity in transformers usually does not change dynamically based on the data.
    Transformers are flexible and can easily incorporate additional information by chopping it up and feeding it into the model with self-attention mechanism.
    Whisper is a copy-paste transformer that works well with melSpectrogram.
    Transformers can be used in RL to model sequences of states, actions, and rewards.
    Transformers are also used in AlphaFold to model molecules computationally.
    Transformers can easily incorporate extra information into a ComNet by chopping it up and using self-attention.
    Transformers are more efficient and optimizable than RNNs due to their shallow wide graph structure, which allows for parallel processing and easy gradient flow.
    RNNs are inefficient and not optimizable due to their long thin compute graph structure.
    Transformers have a shallow wide graph structure, which enables quick supervision to input transitions and easy gradient flow.
    Transformers can process every word in parallel, unlike RNNs which process words sequentially.
    The efficiency of transformers allows for larger network sizes, which is crucial in deep learning.

  • @dr.mikeybee
    @dr.mikeybee 5 місяців тому

    We should also understand the linear operations on weighted representations in the projection matrices. These create a context signature that is easier to compare.

  • @beofonemind
    @beofonemind 11 місяців тому +1

    I'm putting this bad boy in my watch later..... with pen and paper and focus...... but the fact that this talk is available is amazing. Thanks Andrej. Thanks Stanford.

  • @nerouchih3529
    @nerouchih3529 3 дні тому

    28:00
    A unique view at attention. In this image all 6 nodes are related with all 6 nodes in self-attention case. And in cross attention it would be like set A sends a message to nodes in set B. And voila, it's a fully-connected layer! But with tokens passed instead of values

  • @Anonymous-lw1zy
    @Anonymous-lw1zy 4 місяці тому

    FWIW, at 33:00, for the inputs tensor, plus the last character from the targets tensor (so the first quoted section is 47, 58, 1 51, 59, 57, 58, 1, 40), I get:
    [["it must b"], [" Get him "], ["come: And"], ["u look'st"]]

  • @iansnow4698
    @iansnow4698 10 місяців тому +1

    Hi Andrej,
    Its a great historic view of Attention that you showed there, especially the email is a golden discovery in my eyes. All I could found before was as deep as Yoshua's papers.
    I have have a question hope you or some one else could answer here. Is there any connection of the Key Value Query mechanism in the later paper to the weighted average of BiRNN idea in the email? Or maybe that was simply a new idea in the Attention Is All You Need paper?
    Best regards,
    Ian

  • @soumilyade1057
    @soumilyade1057 11 місяців тому +5

    Quality of the audio has ruined an otherwise great lecture 😬 see it to if it can be improved...thank you ❤

  • @jackgame8841
    @jackgame8841 11 місяців тому

    this legend

  • @sitrakaforler8696
    @sitrakaforler8696 Рік тому

    Dam thanks man !

  • @lukeliem9216
    @lukeliem9216 11 місяців тому +21

    A feedback for the Stanford team is to improve the microphone system for their webcasting. The questions posed by people in the classroom are muffled because of noise cancelation (turned on by default) and it really degraded the quality of this seminar. I look forward to a re-do of this Transformer seminar since it is the foundation of Generative AI. So in a nutshell, better microphone setup, and a better explanation of transformer from Andrej. His 6-node graph complicated rather than clarified his explanation.

    • @SuperZardo
      @SuperZardo 5 місяців тому

      You believe they used mics, I think they just spoke into some kind of toilet bowl

  • @seppmeier9961
    @seppmeier9961 Рік тому

    Great dude

  • @exbibyte
    @exbibyte 11 місяців тому +1

    excellent

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому +256

    Audio could be better

    • @HemangJoshi
      @HemangJoshi 11 місяців тому +4

      Definitely

    • @HemangJoshi
      @HemangJoshi 11 місяців тому +29

      Even a $10 mic could give better results than this, they didn't even honor karpathy enough to get a decent mic 🎤 can't believe stanford shot video like this

    • @frankyvincent366
      @frankyvincent366 11 місяців тому +19

      Yes, there's AI algorithm to improve sounds by suppressing room noise... made using transformers 😅

    • @miyamotomasao3636
      @miyamotomasao3636 11 місяців тому

      And in English, too !

    • @recursion.
      @recursion. 11 місяців тому +4

      Dude I'm pretty sure they know about this. Be grateful that you're getting access to materials from one of the top schools in America.

  • @IAmScottCarlson
    @IAmScottCarlson 9 місяців тому +1

    It was really really hard to listen to this one due to the Audio Quality, please resolve for any future presentations.

  • @yuxingben399
    @yuxingben399 Рік тому +2

    Please release more videos from this series.

    • @stanfordonline
      @stanfordonline  Рік тому +4

      Stay tuned! More videos from this series will be published soon.

  • @bpmoran89
    @bpmoran89 Місяць тому

    Describing RNNs and LSTMs as prehistoric is wild

  • @user-xr1cz2ht1b
    @user-xr1cz2ht1b 11 місяців тому +1

    Is it feasible to get access to the code samples that Andrej is talking about?

  • @harrylee27
    @harrylee27 9 місяців тому

    The audience who asked questions sounds like a real Transformer, 46:35

  • @wizche
    @wizche 11 місяців тому +3

    Great content! I would just suggest to invest in better microphones for a more pleasant listening experience

  • @TwoSetAI
    @TwoSetAI 7 місяців тому +1

    cannot hear anything about the questions.

  • @DaTruAndi
    @DaTruAndi 11 місяців тому +4

    Great content, audio quality makes it a bit more challenging to listen to and speakers maybe could try to speak a bit slower and more clearly to make it more accessible to international audiences. Slowing down to 0.75 and turning on subtitles helps a bit. Maybe transcribing with Whisper additionally could be an option.

    • @yuluqin6463
      @yuluqin6463 9 місяців тому

      exactly, 0.75 works better for me

  • @jaredthecoder
    @jaredthecoder 11 місяців тому +3

    Audio is a little clearer if you put it on .75

  • @christofferweber9432
    @christofferweber9432 11 місяців тому +2

    Sad that a great lecture is cut short by questions that could have been taken offline...

  • @saptarshipalchaudhuri5640
    @saptarshipalchaudhuri5640 9 місяців тому +1

    This really piqued my interest. The seminal papers on the road to develop transformers included here makes the introduction just perfect. The Audio placed hurdles thogh. I watch lectures @ 2X speed or more. Here I could not go beyond 1.5

  • @amoghjain
    @amoghjain 4 місяці тому

    Hello!! Thank you for sharing the talk!! is it possible to share the slides as well?? Thanks

  • @dimitargueorguiev9088
    @dimitargueorguiev9088 Місяць тому

    I am skeptical about the common sense and logical/causal reasoning capabilities of the Transformer-based architectures. The fact that out of N different scenarios one can see output which in M < N cases it can be explained with adhering to logical/causal reasoning does not mean that the Transformer-based architectures induce logical/causal reasoning.

  • @23232323rdurian
    @23232323rdurian 11 місяців тому +3

    the AUDIO is real choppy.....hard to make out the words spoken...but great lecture

  • @raminanushiravani9524
    @raminanushiravani9524 8 місяців тому

    anyone knows where to find the slides?

  • @TheNewton
    @TheNewton 18 днів тому

    19:47 so is there a functional difference between calling the usage of softmax `attention` instead of the simpler word `search` beyond trying to be catchy?

  • @lukeliem9216
    @lukeliem9216 11 місяців тому +3

    I think Andrej is still in the process of percolating his understanding on transformers. So the lecture is not as cohesive as his lectures in CS231 on CNN. I look forward to his 2nd or 3rd try on this subject matter. His presentation at Microsoft BUILD is simpler to comprehend, though it is less technical and implementation focused as this lecture.

  • @vamshi3676
    @vamshi3676 5 місяців тому

    He has mentioned that multihead is attention in parallel but from other video I understood that a big attention layer is chopped into pieces so that they can be processed parallel. Am I wrong or he missed that point? Please someone clarify 🙏🙏

  • @wolpumba4099
    @wolpumba4099 2 місяці тому

    *ELI5 Abstract*
    *Imagine transformers as super-smart LEGO blocks:*
    * *They learn by paying attention:* Transformers figure out what's
    important in a bunch of information, just like you focus on the
    right LEGO piece to build something cool.
    * *They talk to each other:* Transformers share info, like when you
    ask a friend to pass a LEGO brick.
    * *They can be built in many ways:* You can make different things with
    LEGOs, and transformers can learn to do different stuff too! They
    can understand words, make pictures, and even play games.
    * *They get better with practice:* The more you build with LEGOs, the
    better you get. Transformers get smarter the more they learn from
    examples, like getting better at building a castle after making a
    few towers first.
    * *They need a little help sometimes:* Sometimes you need instructions
    for a fancy LEGO build. Transformers can also use hints to learn
    faster, especially when they don't have lots of examples.
    * *They like to remember things:* Transformers have a scratchpad, just
    like you use a notebook to remember steps, so they don't forget
    important stuff.
    *Transformers are changing the world:* They're like the new building
    blocks for computers, making them understand us and do much cooler
    things!
    *Abstract*
    This video explores the remarkable transformer architecture, a
    foundational building block in modern AI. Transformers were introduced
    in the 2017 paper "Attention is All You Need" and have revolutionized
    fields like natural language processing (NLP), computer vision, and
    reinforcement learning.
    The video delves into several key aspects of transformers:
    * *Core Concepts:* Attention mechanisms, message passing on directed
    graphs, and the interplay between communication and computation
    phases within a transformer block.
    * *Implementation:* A detailed walkthrough of a minimal transformer
    implementation (NanoGPT) highlights data preparation, batching,
    positional encodings, and the essential components of transformer
    blocks.
    * *Transformers Across Domains:* The ease with which transformers
    adapt to diverse modalities (images, speech, reinforcement learning)
    underscores their flexibility.
    * *Meta-Learning Capabilities:* Transformers exhibit in-context
    learning or meta-learning capabilities, highlighted by the GPT-3
    model. This suggests potential for gradient-like learning within
    transformer activations.
    * *Optimizability and Efficiency:* Transformers are designed to be
    highly optimizable by gradient descent and computationally efficient
    on GPUs, key factors in their widespread adoption.
    * *Inductive Biases and Memory:* While inherently general,
    transformers can incorporate inductive biases and expand memory via
    techniques like scratchpads, demonstrating adaptability.
    The video also includes discussions on the historical context of
    transformers, their relationship to neural networks, and potential
    future directions in AI.
    *Keywords:* Transformers, Attention, Deep Learning, NLP, Computer
    Vision
    See also: ua-cam.com/video/kCc8FmEb1nY/v-deo.html

    • @wolpumba4099
      @wolpumba4099 2 місяці тому

      *Summary*
      *Introduction to Transformers*
      * *0:05** - Welcome and course overview:* Introduction to a course
      focused on transformers in artificial intelligence (AI).
      * *0:52** - Instructors introduce themselves:* The course instructors
      share their backgrounds.
      *Foundations of Transformers*
      * *3:24** - Introduction to transformers:* The basics of transformer
      architecture are explained.
      * *3:38** - Explanation of the attention timeline:* Discussion of how
      attention mechanisms developed over time.
      *Understanding and Implementing Transformers*
      * *3:51** - Transformer Evolution:* Progression from RNNs, LSTMs, and
      simple attention to the dominance of transformers in NLP, vision,
      biology, robotics, and generative models.
      * *10:18** - Andrej Karpathy presents on transformers* Karpathy provides
      historical context on why transformers are important and their
      evolution from pre-deep learning approaches.
      * *15:15** - Origins of the Transformer* Exploration of foundational
      papers on neural machine translation and the introduction of
      attention to solve the "encoder bottleneck" problem.
      * *20:13** - Attention is All You Need:* Discussion of the landmark 2017
      paper, its innovations, and core concepts behind the transformer
      (attention, positional encoding, residual networks, layer
      normalization, multi-headed attention).
      * *22:36** - The Speaker's view on Attention:* A unique perspective on
      attention as a communication phase intertwined with computation.
      * *25:13** - Attention as Message Passing:* Explanation of attention as
      nodes in a graph communicating with "key", "query", and "value"
      vectors. Python code illustrates the process.
      * *30:58** - NanoGPT: Transformer Implementation* Introduction of
      NanoGPT, a minimal transformer the speaker created to reproduce
      GPT-2, followed by in-depth explanations of its components, data
      preparation, batching, and block structure.
      *Transformers: Applications and Future Directions*
      * *52:56** - Transformers Across Domains:* How transformers are adapted
      for images, speech recognition, reinforcement learning, and even
      biology (AlphaFold).
      * *54:26** - Flexibility with Multiple Inputs:* The ease of
      incorporating diverse information into transformers.
      * *55:43** - What Makes Transformers Special?:* Highlighting in-context
      learning (meta-learning), potential for gradient-like learning
      within activations, and the speaker's insights shared via tweets.
      * *58:27** - The Essence of Transformers:* Three key properties:
      expressiveness, optimizability, and efficiency on GPUs.
      * *59:51** - Transformers as General Purpose Computers Over Text:*
      Analogy comparing powerful transformers to computers executing
      natural language programs.
      * *1:06:28** - Inductive Biases in Transformers:* The balance between
      data and manual knowledge encoding, and how to modify transformer
      encodings.
      * *1:08:42** - Expanding Transformer Memory:* The "scratchpad" concept
      for extending memory.
      *Questions and Answers*
      * *27:30** - Q&A: Self-Attention vs. Multi-headed Attention* Explaining
      the differences and purposes.
      * *46:12** - Q&A: Dynamic Connectivity in Transformers* Discussion on
      graph connectivity in transformers.
      * *50:20** - Q&A: Future Directions* Exploring beyond autoregressive
      models and the relation to graph neural networks.
      * *1:02:01** - Q&A: RNNs vs. Transformers* Contrasting the limitations
      of RNNs and the strengths of transformers.
      * *1:04:21** - Q&A: Multimodal Inputs* How transformers handle diverse
      data types.
      * *1:10:09** - Q&A: ChatGPT* The speaker's limited exploration of
      ChatGPT.
      * *1:10:41** - Q&A: S4 Architecture and Speaker's Next Steps* Focus on
      NanoGPT for GPT-like models and interest in building a "Google++"
      inspired by ChatGPT.
      Disclaimer: I used gemini advanced 1.0 (2024.03.03) to summarize the
      video transcript. This method may make mistakes in recognizing words
      and it can't distinguish between speakers.

  • @sansin-dev
    @sansin-dev Рік тому +5

    It's a pity the audio is so bad

  • @nbme-answers
    @nbme-answers 11 місяців тому

    10:15 START

  • @laurentprat8219
    @laurentprat8219 10 місяців тому

    hello, what is 20 for in the node class?, it is the size of the embedding vector (only 20 token)? - (code shown at: 25:30)

    • @faiqkhan7545
      @faiqkhan7545 7 місяців тому

      20 by 20 matrix. initialized randomly, that will be trained during training backpropagation

  • @Alex-fh4my
    @Alex-fh4my 11 місяців тому

    I see Andrej I click.. easy as that..

  • @harriehausenman8623
    @harriehausenman8623 11 місяців тому

    Wow! Steven Feng looks *a lot* like Andrej Karpathy 😆

  • @paparaoveeragandham284
    @paparaoveeragandham284 5 місяців тому

    nice

  • @einemailadressenbesitzerei8816
    @einemailadressenbesitzerei8816 10 місяців тому

    5:18 what is "history encoding"?

  • @user-wq3vp7vc4x
    @user-wq3vp7vc4x 11 місяців тому

    Is the guy asking questions using a voice encoder, or does he have a voice that deep cuz he’s 12 feet tall?

  • @abdulnim
    @abdulnim 9 місяців тому

    Andrej ignored the transformer in the first slide but he keep asking questions.

  • @sahreenhaider9906
    @sahreenhaider9906 4 місяці тому +1

    What questions did Megatron ask?
    I mean the audio was pretty bad

  • @user-wq3vp7vc4x
    @user-wq3vp7vc4x 11 місяців тому

    What was the sentence he said before, ‘I have to be very careful?’

  • @user-xn8dp5zy8t
    @user-xn8dp5zy8t 11 місяців тому +2

    Really bad audio quality, please ensure the speakers have better microphones next time

  • @jbperez808
    @jbperez808 5 місяців тому

    @4:09 "performance increased every time we fired our linguists..." if you listen closely. The auto-transcript caught more of it than the human one.

  • @francescabazzi2515
    @francescabazzi2515 3 місяці тому

    All fantastic! 😊 Thanks a lot! 🙌 Shame about the terrible audio 🔊👎

  • @1ntrcnnctr608
    @1ntrcnnctr608 Рік тому +2

    when auto "mastering"/EQ of audio integration here on YT?

    • @1ntrcnnctr608
      @1ntrcnnctr608 Рік тому

      @@hyperadapted yup, yearning for quality these days

    • @1ntrcnnctr608
      @1ntrcnnctr608 Рік тому

      @@hyperadapted "everyone will have a better learning experience" - 👑

  • @gabrielepi.3208
    @gabrielepi.3208 11 місяців тому

    Hey Stanford, a GPT is not needed to understand that you need some mics in the audience for better audio…

  • @gregx8245
    @gregx8245 3 місяці тому +1

    Div Garg's audio is so horrible, I'm moving on to other videos at the 1 minute 30 second mark. You guys have a lot to learn about video production. (Have you heard of microphones?)

  • @harshitkumar5147
    @harshitkumar5147 5 місяців тому

    Where do I get the slides?

  • @niclored
    @niclored 3 місяці тому +1

    If you dont work in the quality of the audio everything you did for this presentation is kinda ruind. Please try with a better mic since this is the stanford account and this is fairly recent. Audio should not be an issue and in this video is.

  • @aojing
    @aojing Місяць тому +1

    What is wrong with the voice of questioners? Is his audio deliberately post-processed by Stanford?🙃

  • @ankurkumarsrivastava6958
    @ankurkumarsrivastava6958 2 місяці тому

    Can we get the slides?

  • @yurcchello
    @yurcchello Рік тому

    please reupload with better sound quality

  • @0xggbrnr
    @0xggbrnr 11 місяців тому +1

    Use transformers to improve the audio quality next time.

  • @jonathanr4242
    @jonathanr4242 11 місяців тому +1

    You think 2011 was bad? I was doing nn image processing at the turn of the century

  • @harriehausenman8623
    @harriehausenman8623 11 місяців тому +1

    And one would think Stanford could afford microphones for their presentation, instead of the tin-cans they obviously use here.

  • @hussienalsafi1149
    @hussienalsafi1149 11 місяців тому

    😍😍😍😍😍😍😍🤠🤠🤠🤠