Transformers from Scratch - Part #2

Поділитися
Вставка
  • Опубліковано 31 тра 2024
  • ➡️ Learn more about Trelis Resources at Trelis.com/About
    TIMESTAMPS
    0:00 Welcome and Link to Colab Notebook
    3:20 Encoder versus Decoder Architectures
    8:34 What is the GPT-4o architecture?
    10:37 Recap of transformer for weather prediction
    15:30 Pre layer norm versus post layer norm
    19:42 RoPE vs Sinusoidal Positional Embeddings
    26:00 Dummy Data Generation
    26:40 Transformer Architecture Initialisation
    30:48 Forward pass test
    32:40 Training loop setup and test on dummy data
    39:10 Weather data import
    45:40 Training and Results Visualisation
    47:20 Can the model predict the weather?
    51:32 Is volatility in the loss graph a problem?
    53:50 How to improve the model further?
  • Наука та технологія

КОМЕНТАРІ • 3

  • @loicbaconnier9150
    @loicbaconnier9150 15 днів тому

    pre norm against post norm differ only for the first attention layer no ? So if the embedding vectors and positionnal added embedding are normalized it must be the same, no ?

    • @loicbaconnier9150
      @loicbaconnier9150 15 днів тому

      I correct myself it' s not the same, in one case we only normalize the shift (post feed Forward layer) keeping the previous vector, in the other we normalize the added vectors. Why do we not normalized the shift first and the added vectors after ?

    • @TrelisResearch
      @TrelisResearch  14 днів тому +1

      yeah, that was my original question too!
      But, the difference is as per your follow-on comment, in pre-norm we normalise only the shift - as you point out.
      I suppose you could additionally normalise the sum, although I guess that is empirically not worth the added step (most of the benefit is there from doing the shift normalisation).