DeepSeek's GPU optimization tricks | Lex Fridman Podcast
Вставка
- Опубліковано 10 лют 2025
- Lex Fridman Podcast full episode: • DeepSeek, China, OpenA...
Thank you for listening ❤ Check out our sponsors: lexfridman.com...
See below for guest bio, links, and to give feedback, submit questions, contact Lex, etc.
GUEST BIO:
Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects.
CONTACT LEX:
Feedback - give feedback to Lex: lexfridman.com...
AMA - submit questions, videos or call-in: lexfridman.com...
Hiring - join our team: lexfridman.com...
Other - other ways to get in touch: lexfridman.com...
EPISODE LINKS:
Dylan's X: x.com/dylan522p
SemiAnalysis: semianalysis.com/
Nathan's X: x.com/natolambert
Nathan's Blog: www.interconne...
Nathan's Podcast: www.interconne...
Nathan's Website: www.natolamber...
Nathan's UA-cam: / @natolambert
Nathan's Book: rlhfbook.com/
SPONSORS:
To support this podcast, check out our sponsors & get discounts:
Invideo AI: AI video generator.
Go to lexfridman.com...
GitHub: Developer platform and AI code editor.
Go to lexfridman.com...
Shopify: Sell stuff online.
Go to lexfridman.com...
NetSuite: Business management software.
Go to lexfridman.com...
AG1: All-in-one daily nutrition drinks.
Go to lexfridman.com...
PODCAST LINKS:
Podcast Website: lexfridman.com...
Apple Podcasts: apple.co/2lwqZIr
Spotify: spoti.fi/2nEwCF8
RSS: lexfridman.com...
Podcast Playlist: • Lex Fridman Podcast
Clips Channel: / lexclips
SOCIAL LINKS:
X: x.com/lexfridman
Instagram: / lexfridman
TikTok: / lexfridman
LinkedIn: / lexfridman
Facebook: / lexfridman
Patreon: / lexfridman
Telegram: t.me/lexfridman
Reddit: / lexfridman
Lex Fridman Podcast full episode: ua-cam.com/video/_1f-o0nqpEI/v-deo.html
Thank you for listening ❤ Check out our sponsors: lexfridman.com/sponsors/cv8472-sa
See below for guest bio, links, and to give feedback, submit questions, contact Lex, etc.
*GUEST BIO:*
Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects.
*CONTACT LEX:*
*Feedback* - give feedback to Lex: lexfridman.com/survey
*AMA* - submit questions, videos or call-in: lexfridman.com/ama
*Hiring* - join our team: lexfridman.com/hiring
*Other* - other ways to get in touch: lexfridman.com/contact
*EPISODE LINKS:*
Dylan's X: x.com/dylan522p
SemiAnalysis: semianalysis.com/
Nathan's X: x.com/natolambert
Nathan's Blog: www.interconnects.ai/
Nathan's Podcast: www.interconnects.ai/podcast
Nathan's Website: www.natolambert.com/
Nathan's UA-cam: youtube.com/@natolambert
Nathan's Book: rlhfbook.com/
*SPONSORS:*
To support this podcast, check out our sponsors & get discounts:
*Invideo AI:* AI video generator.
Go to lexfridman.com/s/invideoai-cv8472-sa
*GitHub:* Developer platform and AI code editor.
Go to lexfridman.com/s/github-cv8472-sa
*Shopify:* Sell stuff online.
Go to lexfridman.com/s/shopify-cv8472-sa
*NetSuite:* Business management software.
Go to lexfridman.com/s/netsuite-cv8472-sa
*AG1:* All-in-one daily nutrition drinks.
Go to lexfridman.com/s/ag1-cv8472-sa
*PODCAST LINKS:*
- Podcast Website: lexfridman.com/podcast
- Apple Podcasts: apple.co/2lwqZIr
- Spotify: spoti.fi/2nEwCF8
- RSS: lexfridman.com/feed/podcast/
- Podcast Playlist: ua-cam.com/play/PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4.html
- Clips Channel: ua-cam.com/users/lexclips
*SOCIAL LINKS:*
- X: x.com/lexfridman
- Instagram: instagram.com/lexfridman
- TikTok: tiktok.com/@lexfridman
- LinkedIn: linkedin.com/in/lexfridman
- Facebook: facebook.com/lexfridman
- Patreon: patreon.com/lexfridman
- Telegram: t.me/lexfridman
- Reddit: reddit.com/r/lexfridman
DeepSeek’s success is like an underdog F1 team winning races against giants, not because they had more money or better cars, but because they engineered their way to victory with extreme optimizations. What is considered extreme? Like bypassing the default engine computer settings (NVidias CUDA library, NCCL GPU comms library) and writing their own software to precisely control fuel injection, turbo boost, and power delivery for each track. Heck, web developers always are looking under the hood for optimizations in popular libaries all the time. But that is cheap compared to optimizing for a training YOLO run.
I shouldn’t have eaten so much paste in grade school
Or sniff gasoline in my case
OpenAI, NVDA say that you should.
nodding and pretending I understand what they're saying...
Made me laugh
That’s how I got to be chief safety officer.
It will rub off ..
😂
same to me bro. wkwk
For R1, the main thing was GRPO. R1-zero has very little supervision, just works. It's really a breakthrough. It's reproduced by several groups now.
They have some very smart people at deepseek. If you read and understand some of the technical journals, articles, etc - well mostly understand it - you are blown away
Seems like there is only one people in deep seek that is heavier then all other AI startups 😂
1. They wrote the equivalent of machine code (assemblt) for the GPU (PTX)
2. Used a bunch of other models (Mixture Of Experts/MoE)
3. For MoE, it used a generalized alhorithm to determine which models ("experts") would be relied on and directed GPU compute time towards there along with load balancing; manually choosing the algorithm is bad ("The Bitter Lesson," Rich Sutton 2019), so use those that are scalable so as to avoid local maxima
Excellent to have an impartial conversation about this! Open Source is an excellent path forward. Can’t wait for Digits 😮
If the 5090 launch was any indication, digits will be unobtainable for the average customer.
To make an analogy, you needs to serve 10 different dishes to in your restaurant, but your kitchen have only two stoves, so you need to decide which ones to cook based on what customers ordered, the old way was, you make ten stoves each with its own specialty dishes, but Chinese can’t afford 10 stoves, so they choose to use two and shuffling the dish.
I think if I understand correctly the guy is saying this kind of trick or over engineering is not so great historically in the deep learning field, brute forces have been the best option.
You need a very skill Chinese chef.😂
To be more close to the so called mix of experts idea, you kind of get 10 Chinese who have to learn one dish each, and rotate to use that two stoves, the challenge is to make sure each new chef got a chance to practice his dish.
It's not about being unable to afford it, it's about not being able to get it.
It’s cooking two dishes in one wok at the same time by continuously flipping the wok.
While rice is in wok, noodles are in the air, then flip…
I fed the transcript to ChatGPT to explain this conversation to me in plain language. 😂
Should have fed it to Deepseek :)
@@jacobs8531 Beat me to it!!! XD
Im actually impressed with DeepSeek no lie i already had ChatGpt app downloaded for about a yr and have been using it on and off.... but the PR and recent news about DeepSeek has me playing around with DeepSeek and just trying to catch it out.... i even asked it a vague casual question I was thinking of and it got it right! not using perfect grammar etc about a quote i vaguely remember from someone sometime? Genius!
Same answer . It trains off sensai gpt
I will never use chatgpt or openai, traitors. Deepseek for life or until it don’t work no more.
怎么把视频文字提取出来啊,
The low-level load balancing work is going to be extremely important for the Chinese devs if/when they move to their home-grown AI chips once they get their own non-ASML EUV lithography figured out. It would be really cool if some super-general version was possible that would allow one to mix GPUs like Nvidia and AMD together and the load balancing software would just tie it all together. Then there'd be no more vendor lock-in. Not sure if possible, but would be cool!
I think this exists already it's called OpenCL. Check it out.
The work has already begun and is expected to be resolved within the coming year. The core issue China aims to address is opposing the United States' exclusive monopoly.
Nvidia's goal is to make CUDA as bloated as possible in order to jack up demand for their chips. It's not that hard to optimize the process with low level coding if you care to look.
Engineering work involves many many layers and detailed work, at each element it might be simple, but when dealing large language model, you also have to dive into the microcode level is really challenging, because usually people knows high level libraries and thinking about training strategies don’t know the low level. It also takes long time to get things to mature, given only half year to figure the whole thing you need a good engineering team.
China already do that with their own supercomputer 10 years ago
I've read some Chinese media said DeepSeek R1 now can write the low level code itself to adapt GPU from other brands such as AMD or Huawei. Actually it's already running on Huawei's GPUs now. It claims that scarified 5% performance but reduced 70% of reasoning cost.
Precisely, R1 is supposed to be good in coding and there is no reason why it can’t help in coming with PTX for GPU
Non of this matters it’s old technology
@@Gilberthasit you'll know why it matters you fool.
@@Gilberthasit what exactly?
So are we bullish on amd?
Is it very difficult to accept that DeepSeek developer have really developed something innovative rather than just branding them as ‘lucky’ ?
also they “luck” based on “stolen” openai models as we know 😂 ps. and of course best nvidia chips they bought illegaly…
@ @iliak3937 seems like you already made up your mind based on some stereotype. If you listen to the whole podcast not just this 10mins clip, you will find both Nathan and Dylan are clarifying the dubious claim Sam Altman made that DeepSeek stolen their OpenAl model, without providing any proof. DeepSeek’s code base is truly open and public you can download it on your computer and check with GPT api. Other than Mr Sam’s childish claim there hasn’t been any substantial evidence so far. Financial Times mentioned DeepSeek wrote in their paper that they used distillation(which is a data filtration process) to optimise input data. You can hear Nathan explaining in the main podcast it is a standard practice in model training. So no reason to mock DeepSeek.
Some are still coping... But of course the majority with some brains knows that it was a great feat with all the efforts to handicap their chip industry.
@@iliak3937 seems like you have already made up your mind based on some stereotype. If you listen to the whole podcast not just this 10mins clip, you will find both Nathan and Dylan are clarifying the dubious claim Sam Altman made that DeepSeek stolen their OpenAl model, without providing any proof. DeepSeek’s code base is truly open and public you can download it on your computer and check with GPT api. Other than Mr Sam’s childish claim there hasn’t been any substantial evidence so far. Financial Times mentioned DeepSeek wrote in their paper that they used distillation(which is a data filtration process) to optimise input data. You can hear Nathan explaining in the main podcast it is a standard practice in model training. So no reason to mock DeepSeek.
they weren't branding them as lucky were they now?
Guy on the left has a big freakin BRAIN
Thaks for sharing guys! But I believe there are two things here:
1) AI processes today (both training and inference) are extremely brutal force processes. The inspiration is the brain, but the subparts of the model have very low specialization compared to our brain, and we are going through a similar problem that “evolution” faced till we got to where we are today
2) Developers (specially younger ones) got spoiled with cloud/virtual computing and don’t pay as much attention to the quality of the code and how to get the best out of the hardware. What you just explained was standard procedure when people used to code in punch cards (and I never used them). Complexity abstration in programming has allowed way more people to code, but it doesn’t mean that they code well, because a lot don’t have the base understanding of how a code runs in the computer.
These two things combined brought us to the other side of the pendulum… Therefore I do bellieve that there will still be a lot of improvements related to architecture (not only the network architecture, but the overall system architecture)
Great guest choice
Best lex guests in ai so far. Need to watch the whole episode.
Well said overall. This "bitter lesson" rings familiar. Reminds me of the no early optimizations rule and other things that keep coming up. I do really feel this is true, people aren't truly innovating, they are hunting for the quick and easy wins. People have stopped looking to report 2x gains and settle for 10% because it is not nothing and everyone has settled into the assumption we are on the right track. My hot take is analog processing and compute in memory is the future and 10 years from now people will wonder why we stayed on transistors and massive gpus without stopping to investigate memristors more.
Simplicity wins because (most of the time) complicated things make the training slow and the scaling difficult. The only exception to this (and deepseek showed it) is when "complicated things" (aka custom ptx code and GPU kernels) safe you memory/compute/communication during training, enabeling you to scale up Data/Inference/...
The equivalent of compiler lectures from the 1970s. Examining sand grains....
Excellent conversation, smart people and Lex is so cool!
Tldr= lex listens to someone speak, and than completely ignores it and will bring up something He thinks is important, and the guest have to just head nod and pretend he is smart....
Classic MAGA, it's all just about pretending
Yeah but can they change a flat tire
Probably not, they will just design a tire that never goes flat to solve that problem tho
Can you? - GPT 6.0 in Azimov mode
Can you?
It's a joke as obviously they have high intelligence in compute but yes I can
But can you create a tire that doesn’t go flat?
As the owner of the microwave gang subreddit i am happy to help
I have no idea what they're talking about I agree though.
Yes, tech can be very tricky stuff. Totally get it. Hey I just wonder if these guys ever tried turning it off and turning it on again? Just a thought
I find sometimes you can lightly tap it or give it a little shake and it works better.
@@josephposenecker9741 Yes, I think the tiny AI experts living inside the motherboards get stuck trying to crawl through the wires sometimes, and this can free them up
This channel has inspired me to learn coding thank you.❤
Please put Guest name in title!!!!
I think this would be a great lesson for the US. Instead of just having all the best brains using the best technology available, set aside a few of those brains and start a project where they are only allow to use less than ideal technology. It could be a college project funded by the government, where students work on it with less than ideal technologies.
I won't work in the US.
Just found this channel. Damn, just what I need all the time.
So it does things faster. But does it do things better?
Unless the innovation comes from US, its just luck. Such a novelty. Isnt it how every research ever works that you try 1000 things, 1 will work?
Thats crazy to say
American tech giants seething and so are the people who bought nvidia stocks
@toma9596 its not. I have huge respect for Lex Fridman. I enjoy his podcasts. But the way he was trying to diminsh their achievements while the other guys are trying to admire with deepseek achieved, it was quite visible.
I came across and left, to learn it on Deepseek
Right?
_"With peace and love, of course..."_
WOT?
Sips tea, sits back and folds arms.
I felt so proud that I could understand all of this.
spike are usually hint people send .
Hmmm...does that mean that the Nvidia chip was not optimized by Nvidia itself?
Basically yes
Not so much the chip itself but the layers of access/SDK which are by definition generalised. And also - just like gfx drivers in the past - optimisation is a continuous and incremental process.
Nvidia optimised for a more generalised use case where there's many parameters they can't assume so must pick reasonable trade-offs. What Deepseek could exploit is that it's not a generalised use case and they can predict great parameters by knowing exactly the model architecture, exactly the training cluster dimensions and where the limits are.
A Toyota Corolla is an amazing generalised solution to get the average driver where they want to go on average roads and unpredictable environments.
The Red Bull RB19 F1 car was an amazing solution for getting Max Verstappen to win a championship within tight design constraints, known tracks, predictable conditions etc.
Both are supremely hard to solve, but CUDA is like the Corolla and Deepseek needed a RB19 for what they wanted to do.
I listened to the whole thing. Understood maybe 15% of it.
Fascinating discussion! I truly appreciated the depth of knowledge, wealth of experience, and genuine enthusiasm the guests brought to the conversation.
The interviewer at the end sounded like a cowboy in a western movie
Now I know what my girlfriend feels like when I talk tech.
Nice videos especially the old-looking camera and the settings of the Drift Trike video is nostalgic...
in 2014 I was still a kid, playing and very excited...
but yeah it made me stop and reanalyze a little bit of things in my life... like, create one game then earn a lot of money, after that just enjoy the world, fkkk! ! 😆🤩
The Richard Sutton's bitter lesson memo should be read by every AI researcher before they start their journey.
We as humans trying to emulate functionality of the brain. I wonder if a robot will look at itself and try to recreate its trace and via.
nvidia or cpu does this ? If in fullscreen mode if compiler is not skipped . Many if . Mosyt do not check all the if
As someone trying to train a model for financial analysis, this interview was super useful….
Ai and training models really is the wild west
Try creating a console with a loophole for both ends. Loophole console.[]
Interesting, I remember reading articles about how developers would optimize and write very sloppy code for video games during the 90s to get every ounce of performance out of the system. When i hear "sloppy code", I think of innovation and creativity but some cases, it can be someone being "LAZY". If you can get a 10%+ performance gain with "Sloppy Code" there might be a case to use "Sloppy Code" this is a race we have entered...
I cant take the word "RIGHT" from this man anymore ! - Take the transcript and upload to your LLM, remove the word "right" and replay text to speech. Its so much better... 💯
I didn't notice till you said this. Damn it.
Deepseek hired gold and silver olypiads in math, physics, etc to build their team. Good luck.
There comes a point where the model becomes better than every math genius in the world.
So, does Deepseek really has lots and lots of h100 like Elon Musk claim?? if they really have so many h100, why will they do the extra hard work to program in assembly language ??
They don't. DeepSeek focuses on "A" of AI.
Elon doesn't know what he's talking about.
Why would you want to walk if you have a car?
Ah yes, the multilayer perceptron, the feed forward network and the attention mechanism. All part of my daily lexicon 🤯
Microwave goes MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM BEEP.
i know some of these words
Right? Train a person to talk right? Am I right?
did that guy say luck🤦 what an embarrassment, there is nothing lucky about what DeepSeek's team have achieved.
It's skill and hard work.
GPU assembly code
I think one day Deepseek will implement and improve to a extent that it can decide for itself
11:50 I imagine those spikes in loss could be due to the model trying to "test out" new ideas. When humans learn language, there are times when the language starts to be used "creatively". Almost in the same sense as how new slang is born. Which in a broader sense could be a byproduct of learning to generalize and improvise/novelize.
Hmmm... I think you might be anthropomorphizing the math/algorithm behind these transformer models, but that might actually be a good thing. In the future, it’s possible that most of the "low-level" work could be handled by A.I. models-which might be concerning in some cases, as we could lose control of the software we're running. We might end up focusing primarily on high-level thinking, relying on our intuition to explore new ideas (something current models aren’t really good at). Imagine saying, "Hey GPT-8.0, I was wondering if the loss spikes during transformer model training might be due to..." and then the model replying, "Well, the idea of testing out new ideas might hold true for humans, but the gradient descent mechanism doesn’t really incentivize the discovery of new high-level concepts and ideas. It’s more likely that those loss spikes are related to, well, blablabla..."
@@hmind9836That is assuming that 'it knows ' about a context that it has never seen before.
Can someone explain it in star wars
That title tell you the level of coping this country has.
Good that I didn't do what I wanted last year, even at the low level, I'm flaked
None of this is specially technical or hard to understand. It's just programming at a lower layer than Pytorch and a much higher layer than machine code. If you do not understand these things, then why are you watching, go study the fundamentals first.
Go to bed.
So they didn't have to worry about backward/cross compatibility, and optimized for their hardware.
If I hear the word “right” again, I’m gonna scream.
I think I understood it like 98.9% of this; is there anyone who can help me understand the rest of 1.1%, please? Thanxxxx.
I understood about .03% of this.
I could be wrong but it seems DeepSeek didn't really need the best programmers, just better programmers than Nvidia.
voice mode only ai is where things are great
You understand about 30% of what
Did you know the earth was actually always round, not just when Christopher Columbus sailed to America? Blew my mind.
I have a dream 😴
I don't know how many thousands of Nivida advanced chips that Chat GPT used, or how many billions of dollars spent on training the algorithms. I asked Chat GDP to solve a very basic mathematical question,
A heavy smoker can make 1 new cigarette from 3 cigarette butts. He has 11 cigarette butts. So, how many new cigarettes can he make?
I was very disappointed at the wrong answer provided by Chat GTP 😢😮😮
Americans extremely good in blah blah 😂😂😂😂
Chinese optimized Deepseek down to hardware level...need extreme technical abilities to do it...its like creating a web page using assembly language😮
Rather than being concerned about the number of Nvidia GPU would be required, the breakthrough is in the development of robots and self driven cars because R1 can make dedicated models so cheap and efficient
There comes a point in everyone life where you you brain ability to comprehend lags your interest in what js being said. This is one of those times…
Simple math
Only honest AI matters, all others are GIGO.
Another podcast explained that using PTX, the lower-level machine language, DeepSeek can use non-NVIDIA chips to train their AI model. This means they can use Huawei chips to train their LLM immediately. Huawei chips may be inferior in performance but abundant in supply. This podcast also discussed data center scale and power infrastructure for training LLMs. I think the playing field may be equal when all the factors are taken into consideration.
3 Ultra Nerds in 1 room production :P
❤❤❤❤❤❤
Mistral AI + DeepSeek =🧠❤💪
Can anyone tell me if I'm understanding the real significance of deep seek is taking a big step to synthetic data?
The big model cost a lot to run, but you can use it to output correct questions and answers and use it to train a smaller model, and that smaller model will become really good, that’s very significant because you can run a very small model on your laptop now, and it allows a lot more players to get in
DeepSeek is not one big innovation but multiple of them working together to make a bigger splash.
Natural selection pressure of lack of short trees forced giraffe to evolve elongated necks,
Similarly USes trade policy selection pressure of restricting high end GPUs is pushing Chinese to evolve better software to compensate for lack of high end GPUs.
Americans are fighting the losing battle of trying to stop evolution.
It's easier to make someone do the work and steal it. China stole from US, US will steal from China
What did i just understand 😂😂😂
I'm gonna share this to appear pretentious
Trust me bros it works.... experts ... no f.... about ... yolo run ...
Wait wait, it still uses GPUs?!? No way. I thought they figured out how to run on 256kb of RAM.
Stop saying right.
Yeah too bad deep seek really sucks if you actually take the time to use it I was trying to have it to do some simple things I can't even do
Right
Don't these experts you interview ever use diagrams? Every engineer I have come across does.
Diagrams available in the research papers.
Look at his head looks like quantum computer
Unpopular data viewpoint- there is no bad data. Bad sources maybe. Bad data is like bad press. It still has value.
You haven't written a single piece of software in your life. Don't have such strong opinions on things you know nothing about
@@leonlysak4927Don't you know ChatGPT has made everyone an expert on everything.
There's absolutely bad data wtf
The true sign of understanding is being able to explain complex concepts so that a 5th grader could understand them. Grade = F
If you understand all of this, raise your hand.🙋♂️
I wanna visit r/microwavegang now lmao
They are from different planet😅
So basically multi-model-sharding across the GPU fabric ..
Can't the American models now take the lessons learned by the Chinese and integrate them into their models with all the hardware at their disposal and leap forward?