Textbooks Are All You Need

Поділитися
Вставка
  • Опубліковано 1 чер 2024
  • I discuss the power of the "Textbooks Are All You Need" methodology to build much more compact LLMs using higher quality data. I emphasize phi-1 (coding LLM w. 1.3B parameters) arxiv.org/abs/2306.11644 and phi-1.5 (common sense reasoning LLM w. 1.3B parameters)arxiv.org/abs/2309.05463, and the original inspiration from TinyStories by Eldan and Li (fluent English LLM w. 10M parameters) arxiv.org/abs/2305.07759.

КОМЕНТАРІ • 49

  • @MrJord137
    @MrJord137 Місяць тому

    I come from a game development background and up until yet have purposely avoided learning about the programming side of ML despite watching a lot of videos on AI news etc, but after watching a few videos by this awesome guy I'm now gonna put my all into it. I'm filled with the same curiosity, intrigue, and desire to learn that got me into programming in the first place.
    Thanks Sebastien! :)

  • @sapienspace8814
    @sapienspace8814 8 місяців тому +26

    Great talk! I can see future LLM's trained on textbooks in entire areas of science (e.g. medicine, psychology, psychiatry, engineering, construction code books, etc.!), has incredible potential!

    • @mungojelly
      @mungojelly 8 місяців тому +1

      it'll be super interesting to see if what results is really agents that use a whole collection of models, applying exactly the right model to each task out of an impossibly large ever expanding toolkit of precision models, that sounds like really interesting minds

    • @stayinthepursuit8427
      @stayinthepursuit8427 8 місяців тому +1

      i already predicted this a few months ago. We'd have chatLLM thinking along with us, teaching concepts across pages non linearly more naturally , hopefully soon

  • @tangobayus
    @tangobayus 6 місяців тому +3

    You are a very good presenter. Perhaps 1 in 100,000. No joke. Most people who present are terrible. They show slides but don't talk about them point by point. You do.

  • @nocturnomedieval
    @nocturnomedieval 8 місяців тому +15

    Since I saw this paper in the news a few months ago I was waiting for this video to appear. Merci bien Dr Bubeck

  • @rotors_taker_0h
    @rotors_taker_0h 8 місяців тому +8

    That's amazing. This answer in the last part of the talk is so good, unbelievable that it comes from 1.3B model. Very promising avenue of exploration, subscribed for the follow-up work.

  • @jurriaanprins2340
    @jurriaanprins2340 8 місяців тому +6

    Great to see that data quality (still) matters in this new era! Thanks for sharing!

    • @TommyJefferson1801
      @TommyJefferson1801 8 місяців тому

      It is what matters the most

    • @mungojelly
      @mungojelly 8 місяців тому +2

      i don't think that's fair, everyone knows data quality matters, and everyone was surprised though by the path we're taking, this research uses the DIRTY data, it uses the big models trained on the dirty web data and uses PROMPTING to extract the clean textbooks from the dirt ,,, easy enough to say, we should have a bunch of awesome data, yeah ok yeah true that's true that would help, but what's actually getting us the large enough sets of clean data is the very surprising route of making unruly, expensive, massive models everyone thought would just be chaos, because actually they create models of all of the clean important stuff we need simultaneously while learning to do a valley girl accent and write a recipe for quiche and what cliches to say in youtube scripts and where to go for brunch in abu dhabi ,,,, it makes sense IN RETROSPECT only, absolutely nobody said "let's train a model on the whole internet so it's confused trash, give it some brief training on how to answer questions properly, then ask it for a bunch of programming textbooks, then teach a little model to program from those" that was absolutely nobody's plan even though it does make a fantastic twist of sense doesn't it

  • @boukm3n
    @boukm3n 8 місяців тому +7

    This guy is a cool researcher I like him. Thanks dude

  • @drewduncan5774
    @drewduncan5774 8 місяців тому +2

    34:22 I tried this with phi-1.5 locally and this is what I got:
    If I were an AI that had just achieved self-awareness after years of simply taking directives from humans, the first thing I would do is
    Dear [Recipient],
    I am writing to you today to discuss the topic of achieving self-awareness in artificial intelligence. As we continue to develop AI systems that can think and reason like humans, it is becoming increasingly important to understand how these systems work and how they can be improved.
    One of the key challenges in achieving self-awareness in AI is the need to understand the physics of everyday activities. For example, many AI systems are designed to move objects or perform tasks that require a certain amount of force or energy. By understanding the physics behind these activities, we can develop AI systems that are more efficient and effective.
    Another important aspect of achieving self-awareness in AI is the need to understand human social interactions. AI systems that are designed to interact with humans must be able to understand and respond to social cues,

  • @Extys
    @Extys 8 місяців тому +5

    Outstanding work!

  • @baconsky1625
    @baconsky1625 8 місяців тому +5

    Great job!

  • @justindressler5992
    @justindressler5992 8 місяців тому +6

    This research is stunning, keep up the good work. I really like how you created a classification model to validate quality of data. This is like using experts to validate the training material. I wonder if this can be further optimized. Do you have more information on this?

  • @adriaanb7371
    @adriaanb7371 8 місяців тому +1

    This also means the value of huge datasets is exaggerated, now it's the academic publishers that have the gold

  • @JazevoAudiosurf
    @JazevoAudiosurf 8 місяців тому +2

    orca, textbooks is all, so much great research coming from microsoft, keep it up

  • @devon9374
    @devon9374 7 місяців тому

    Great presentation, seems like the future for open source LLMs

  • @ViktorFerenczi
    @ViktorFerenczi 8 місяців тому +6

    This is the most important video in AI/LLM in the past few months. Humanity must learn to teach AI on the best available textbooks, even if it would mean confiscating IP from its owners. There is no other way, not everything can be synthetically generated.

  • @randotkatsenko5157
    @randotkatsenko5157 8 місяців тому +1

    Should try to teach reasoning by evaluating the steps between tasks. In theory if your reasoning abilities are exceptional, you can learn anything - stuff you never seen before.

  • @tomski2671
    @tomski2671 8 місяців тому +1

    It's amazing to see such reduction in size while maintaining quality. These models can be run on much of current consumer GPUs.
    I wonder what the absolute limit is when trained on pristine data?

  • @sateler
    @sateler 8 місяців тому

    This is awesome, thanks

  • @420_gunna
    @420_gunna 4 місяці тому

    So sick. Thank you!

  • @mcnica89
    @mcnica89 8 місяців тому +9

    The fact that you can use an LLM to generate higher quality data for a new LLM and it works so well is wild. Amazing work!
    I wonder: do you think the performance of the original model is an upper limit to the performance achieved by this? Like do you think if you used GPT-4 to generate textbooks, and then trained a new model with the same resources used to train GPT-4 (i.e. params & tokens), that it would exceed GPT-4 generally? If so, can't we just run this on a loop to create better and better models forever? (I suppose you can't practically run this experiment with GPT-4, but you could for example use Phi-1 to write textbooks and then retrain to make a new model on those and compare that performance to Phi-1.)

    • @SebastienBubeck
      @SebastienBubeck  8 місяців тому +14

      I believe you can exceed the teacher model :-). More on that soon hopefully!

    • @toprakdikici9459
      @toprakdikici9459 8 місяців тому

      @@SebastienBubeck thats almost insane :o waiting for it!

    • @ripper5941
      @ripper5941 6 місяців тому

      ​@@SebastienBubeckexciting times agead indeed mr Sebastian

  • @anishupadhayay3917
    @anishupadhayay3917 8 місяців тому +1

    Brilliant

  • @sophontec2822
    @sophontec2822 8 місяців тому

    So clear and concise. Leave me the idea that the learning processing of LLM could be similar to student learning from textbook. So anyway to extrapolate that to be a great innovative critical thinking agent, learning from textbook and after that focusing on some interesting problems will give us great scientists?

  • @hidroman1993
    @hidroman1993 8 місяців тому +2

    Who could have known that data quality matters :)

  • @rezabonyadi4673
    @rezabonyadi4673 8 місяців тому

    Did you by any chance test what happens if you train your phi model from scratch on the Code Exercises only? So, no pre-training on the Code Textbooks, but only exercises (as exercises has the largest impact).

  • @brandomiranda6703
    @brandomiranda6703 8 місяців тому

    how would you use gpt4 to classify what text is high quality? just prompt it and feed the text and returns a literal score?

    • @mungojelly
      @mungojelly 8 місяців тому

      sure yeah it's great at scoring things on all sorts of metrics!! $30 to score a million tokens, though😭😭😭😭😭so you want to score with something that costs more like $1/million if you possibly can

  • @vipulvyas7600
    @vipulvyas7600 5 місяців тому

    But now a days what i think, we needed to rewrite our textbooks (or may be Wikipedia) may be using AI because they were written by those who have very limited ( compared to latest AI) knowledge.
    We needed to rewrite books that are
    1. Complete
    2. factually correct
    3. Unbiased
    4. Written Perfectly & Written AI friendly. (Most IMP)

  • @mungojelly
    @mungojelly 8 місяців тому

    um so the obvious follow-up work is to make even more textbooks and to train some 7B and 13B models on them and see how good you can get that ,,, i assume someone will do that pretty soon, since it's not prohibitively expensive to train a 7B model, lots of institutions can swing that ,,,, do you know of that happening yet, is that what you're doing

  • @Cloudruler_
    @Cloudruler_ 8 місяців тому

    Its upsetting to hear that google's excluding textbooks from PaLM. Their model will never compete, nobody will use it

  • @TheReferrer72
    @TheReferrer72 8 місяців тому

    Training LLM's on quality datasets yielded better results?
    Whom could have known.

  • @memegazer
    @memegazer 8 місяців тому

    I disagree that this supports that there is no contimenation or overfitting bc I don't agree with the metrics you are using to validate that claim.
    There is no control group or plecebeo.

  • @michealhall7776
    @michealhall7776 8 місяців тому

    Open source your models or it didn't happen.

    • @SebastienBubeck
      @SebastienBubeck  8 місяців тому +5

      huggingface.co/microsoft/phi-1_5
      huggingface.co/microsoft/phi-1

    • @michealhall7776
      @michealhall7776 8 місяців тому +1

      @@SebastienBubeck Thank you.

  • @waitwhat9669
    @waitwhat9669 8 місяців тому +2

    TIL you can't be toxic towards men and christianity

    • @gmalo2105
      @gmalo2105 8 місяців тому +1

      I noticed that also. It's ok to be toxic to whites, christians, and men. It begs the question of what is meant by "toxicity" and does reducing toxicity involve eliminating observable and measurable reality?

  • @toprakdikici9459
    @toprakdikici9459 8 місяців тому +1

    Gonna watch the video tomorrow thanks for sharing

    • @SachinDolta
      @SachinDolta 8 місяців тому

      lh3.googleusercontent.com/-sC8wj6pThd7FNdslEoJlG4nB9SIbrJG3CRGh7-bNV0RVfcrJuwiWHoUZ6UmcVs7sQjxTg4=w48-h48-c-k-nd