I've been using Deepseek-coder for building some framework in the LLM field and it has been very impresive so far. I cannot imagine what V2 will bring.
It doesn't need to cost $4M for the chip. you could implement it in an FPGA or on a project chip like Tiny Tapeout, or even a metalized gate array depending on the number of units you need.
I like jailbreaking models and wouldn't mind suffering through rumbles jank to watch how others do it. Especially since anytime I talk about it on youtube my comments disappear.
That snake one is obvious wrong. Generally speaking these models are using the same architecture and training method published by OpenAI. That's why their outputs and capabilities are quite similar.
Tryed it for coding and it answers the explination in Chinese😂😂😂😂 also it outputs jiberris some times but it could be because of my limited hardware an RTX 3070
Okay that sounds like more of a headache than I can stomach right now. and insisting on 'Explanations in English only' in the system prompt is a thing but to be honest I'll give it a try sometime whenever cos.... Sonnet 3.5 reasons.
This is a great point - the 16V quantization of DeepSeek Coder V2 actually fits entirely within 24GB of VRAM. I managed to fit it all on a single 4090.
I agree! I chose DeepSeekCoder initially because of it's speed and relative ability based on first-shot prompts. GPT4 generally misses unless you're asking something it's already known to be good for, like drafting complex bash scripts or FFMpeg filters. What do you use?
@@aifluxchannel I generally just use mixtral 8x7B and gpt-4 for everything else but mostly for repetitive tasks and C embedded stuff. Which is mostly using it to find things in docs
It costs 0.14 and 0.28 USD for 1 mln tokens... vs 5 and 15 USD for GPT-4o... So I use it for code specific tasks in agent frameworks because it's knowledge and code generation is very good. But I use GPT-4o as Supervisor because it's much better in this role.
It has great, up to date knowledge. BUT... it's not great in following instructions and calling functions. So it quite often respond in a wrong format, what makes it not-usable on production as Agent Supervisor or for Code Interpreting like with Open Interpreter. Can't wait to see some fine-tuned version with better instruction following to respond in specific format for Agent Frameworks :)
It's definitely smarter than the previous models, but still fails my simple tests. Although, it did technically get one of the Verilog questions correct if you are purely thinking about inference code (i.e. relying on the compiler to infer a more complex mapping) - perhaps that's due to most of the training set being for FPGAs? On that same problem, I was able to get it closer to the correct solution when applying the Socratic method, and it came up with a configuration I hadn't considered (which would be more efficient on FPGAs but not ASICs), although it failed to execute on the idea. For your VHDL test, LLaMA2-7B would get the syntax correct most of the time, so that's not really an effective test. You would have to see how it performed implementing the logic itself, which would probably fail - I noticed it was unable to reason about the parallel nature thinking that the code executed sequentially. Regardless, the improvement of LLaMA3-70B could make it feasible for local deployment for data privacy reasons when using 4-bit (or 2-bit if it doesn't exhibit a huge loss of quality). Otherwise, I don't see much of a benefit over GPT4.
I'm pretty sure Verilog is a pretty niche application for typical LLMs. I'm not sure if this is just a test for you, or an actual use case, but had you considered "black box fine tuning", as in, treating the outputs of common models (ie: deepseek, qwen, llama 3), but then fine tuning a dedicated small language model to adapt those answers into correct responses for your use case?
@@novantha1 I agree, although DeepSeek was trained on it (so was LLaMA3, and GPT 3/4) - this is obvious by the fact that it gets the syntax right most of the time (one of them would often confuse complex process statements for some lisp style language). None of the models are good enough to be used in the practical application space of HDL (let alone debugging), but they can help with some boiler-plate and sketching. The ideal application would be to use them in some agentic way for chip design (with black boxes), but that's not something any are currently capable of. So I was only testing simpler problems (which you would expect most first year DLD students to get correct on a homework assignment). Regarding a small language model to adapt the black boxes, that's very unlikely to work given the reasoning failure of the larger models. The issue being that HDLs are not sequential languages, which could confuse the causal assumption made by language (they're locally causal but globally more like a graph). That said, perhaps Verilog generation and debugging should be the benchmark that replaces ARC for reasoning (once ARC is solved).
Initially I was impressed, then today I asked a very simple question about tsql and it replied with a hallucination about a subject I'm working but had not mentioned. Spooky... it also writes answers in Japanese or Chinese, sorry i don't read either so I may be wrong. Even more worrying, it did the same after a reboot...
I think I might start creating secondary videos with more testing, these would go more in-depth and also include demos on actual local GPUs. Does that sound interesting / like something you'd like to watch?
@@aifluxchannelwell it is pointless generating code if you don’t test it works. Other UA-camrs give a LLM one chance to fix an error if one occurs. Process doesn’t take that long.
this can do pretty crazy coding aswell anyone wants 10 x vision capabilities for chat GPT i figured out something open Ai forgot about, Its called Smart Vision image/text analysis (Paste your own custom instructions for a superior smart chat) It is 10 x better at analysing images and especially reasoning relationships read the examples below I uploaded an image of a cloud that looks like multiple things but it can be interpreted, the one i gave it was a personal photo I took and it has recognised it was a rabbit but not even a random human guessed it every time now on 1st shot so it knows when something is unusual about an image even if you dont say anything, it can also do iq test image reasoning pattern questions relatively well. Another example is upload an image of a model with clothes that fit or don’t fit or you don’t know it can analyse that and tell you with great detail why it fits or doesn’t fit It kind of even understands real logic games when giving good instruction but is limited there by the model not the instruction *IMPORTANT* just gotta follow the instructions given to get the right seed its 1 in 2 chance or so i have absolutely no idea why it needs that. Just paste the conversation starter and you’ll understand what to do.
@@BienestarMutuo its pretty complicated but basically its like 10 x smarter than gpt 4o especially for analysing any uploaded images, its overall just better accuracy you can use it for anything, and tell it your custom prompts aswell
@@BienestarMutuo all you need to do is search up Smart Vision in gpt or the website, in the custom gpt section and it should show up with that orange eye image, it has 50 people usijg it, but for usecase it can be used as anything but be gpt 6 level actually
I would imagine that most of these datasets are compromised in some way or another. I know for a fact that parts of HumanEvalJava is part of large training data sources like The Stack. IMO only new datasets or real usage can accurately evaluate these models. If the model came from out of nowhere and it consistently outperforms by that much, that is a red flag for me.
Yep, this was something I was highly suspicious of with DeepSeek Coder V2. However, I do think there are practical limits to how fast these benchmarks can be fooled / reverse engineered.
@@aifluxchannel They are not "fooled" per se, it's (i'm totally guessing) simply that they could be overfitting. LLM can start overfitting when the problems and solutions for the benchmarks have been included in large datasets used for training the model. Normally, you HAVE to separate training and evaluation data, but there are practical limitations when dealing with LLMs who needs vast data sources which continuously scrape vast amounts of data from the internet. Worst case, in a scientific ethics context, is if they intentionally selected older benchmarks in their results for their better performance relative to other models, and older benchmarks are more likely to suffer from data leakage, but that is a more serious accusation. I will try it myself nonetheless, the code quality of your example was really good at least compared to GPT :)
I was just thinking about VHDL, and you did it. Thanks
I've been using Deepseek-coder for building some framework in the LLM field and it has been very impresive so far. I cannot imagine what V2 will bring.
It doesn't need to cost $4M for the chip. you could implement it in an FPGA or on a project chip like Tiny Tapeout, or even a metalized gate array depending on the number of units you need.
I like jailbreaking models and wouldn't mind suffering through rumbles jank to watch how others do it. Especially since anytime I talk about it on youtube my comments disappear.
Interesting
Great video. But why you are not running results of code and don't showing up how it handle it from first try?
Can't wait for state sponsored LLMs to start introducing code vulnerabilities when the user's location is in a rival country 🤣🤣
Yes I really enjoy a no-brain knee jerk reaction when hearing the word China😂
Knowing what MooCode is makes me a dinosaur now? They even did it at Stanford!
That snake one is obvious wrong.
Generally speaking these models are using the same architecture and training method published by OpenAI. That's why their outputs and capabilities are quite similar.
I am new php developer which llm i can use Deepseek or which one you recommend ?
What could I include in a dedicated coding llm video that would help you make this decision?
Tryed it for coding and it answers the explination in Chinese😂😂😂😂 also it outputs jiberris some times but it could be because of my limited hardware an RTX 3070
Okay that sounds like more of a headache than I can stomach right now. and insisting on 'Explanations in English only' in the system prompt is a thing but to be honest I'll give it a try sometime whenever cos.... Sonnet 3.5 reasons.
Did the generated snake game run?
Yes, when executed on my local machine.
Codestral 22B and Llama 3 70B are tiny in size compared to gpt-4 t, it would be pathetic if they were better than gpt-4 t
This is a great point - the 16V quantization of DeepSeek Coder V2 actually fits entirely within 24GB of VRAM. I managed to fit it all on a single 4090.
x.com/teortaxesTex/status/1802875700703068256
Which size model did you run and what is its context length?
On Chat online you have only big models. And both had 128k context
The wise EEs are all over this stuff. Especially those of us who design both HW and SW.
Much more impressive than Codestral, any specific reason you use this for your work over other models? Great video! 😊
I agree! I chose DeepSeekCoder initially because of it's speed and relative ability based on first-shot prompts. GPT4 generally misses unless you're asking something it's already known to be good for, like drafting complex bash scripts or FFMpeg filters. What do you use?
@@aifluxchannel I generally just use mixtral 8x7B and gpt-4 for everything else but mostly for repetitive tasks and C embedded stuff. Which is mostly using it to find things in docs
@@GerryPrompt unless you have internet issue, why would you use open source models. As claude ai and gpt4 free is still the best for coding
@@tomgirl366check the price of gpt4o or gpt4 against deepseekv2coder.
It costs 0.14 and 0.28 USD for 1 mln tokens... vs 5 and 15 USD for GPT-4o... So I use it for code specific tasks in agent frameworks because it's knowledge and code generation is very good. But I use GPT-4o as Supervisor because it's much better in this role.
It has great, up to date knowledge. BUT... it's not great in following instructions and calling functions. So it quite often respond in a wrong format, what makes it not-usable on production as Agent Supervisor or for Code Interpreting like with Open Interpreter.
Can't wait to see some fine-tuned version with better instruction following to respond in specific format for Agent Frameworks :)
Do you think this model is recommended for VHDL and SystemVerilog coding?
It's definitely smarter than the previous models, but still fails my simple tests. Although, it did technically get one of the Verilog questions correct if you are purely thinking about inference code (i.e. relying on the compiler to infer a more complex mapping) - perhaps that's due to most of the training set being for FPGAs?
On that same problem, I was able to get it closer to the correct solution when applying the Socratic method, and it came up with a configuration I hadn't considered (which would be more efficient on FPGAs but not ASICs), although it failed to execute on the idea.
For your VHDL test, LLaMA2-7B would get the syntax correct most of the time, so that's not really an effective test. You would have to see how it performed implementing the logic itself, which would probably fail - I noticed it was unable to reason about the parallel nature thinking that the code executed sequentially.
Regardless, the improvement of LLaMA3-70B could make it feasible for local deployment for data privacy reasons when using 4-bit (or 2-bit if it doesn't exhibit a huge loss of quality). Otherwise, I don't see much of a benefit over GPT4.
I'm pretty sure Verilog is a pretty niche application for typical LLMs. I'm not sure if this is just a test for you, or an actual use case, but had you considered "black box fine tuning", as in, treating the outputs of common models (ie: deepseek, qwen, llama 3), but then fine tuning a dedicated small language model to adapt those answers into correct responses for your use case?
@@novantha1 I agree, although DeepSeek was trained on it (so was LLaMA3, and GPT 3/4) - this is obvious by the fact that it gets the syntax right most of the time (one of them would often confuse complex process statements for some lisp style language).
None of the models are good enough to be used in the practical application space of HDL (let alone debugging), but they can help with some boiler-plate and sketching. The ideal application would be to use them in some agentic way for chip design (with black boxes), but that's not something any are currently capable of. So I was only testing simpler problems (which you would expect most first year DLD students to get correct on a homework assignment).
Regarding a small language model to adapt the black boxes, that's very unlikely to work given the reasoning failure of the larger models. The issue being that HDLs are not sequential languages, which could confuse the causal assumption made by language (they're locally causal but globally more like a graph). That said, perhaps Verilog generation and debugging should be the benchmark that replaces ARC for reasoning (once ARC is solved).
Initially I was impressed, then today I asked a very simple question about tsql and it replied with a hallucination about a subject I'm working but had not mentioned. Spooky... it also writes answers in Japanese or Chinese, sorry i don't read either so I may be wrong. Even more worrying, it did the same after a reboot...
Wow, you must have just filmed this... Turkey Georgia just finished 2 hours ago!
I do my best to create videos as soon as I validate a topic! Let me know what you want to see next!
excellent! I'm keen to see the lite model in aider
Coming soon!
I tried this on the deepseek website and it sucked seemed like it had a low context length
How does it fare vs claude sonnet?
Can you run the code in future videos or generate tests?
I think I might start creating secondary videos with more testing, these would go more in-depth and also include demos on actual local GPUs. Does that sound interesting / like something you'd like to watch?
@@aifluxchannelwell it is pointless generating code if you don’t test it works. Other UA-camrs give a LLM one chance to fix an error if one occurs. Process doesn’t take that long.
Is this better than qwen2 for coding?
I prefer this model to Qwen 2, specifically the 16B version for it's speed.
is the free endpoint running the 230b or the 16b?
Not actually sure, but I think it's the 16B variant. Can confirm this later tonight.
@@aifluxchannel alright, thank you
But you tested Coder-2 instead ?!?!
What did I miss?
Are there VSCode plugins to integrate this for local dev? Of how do you use it to code?
Code your own simple wrapper in an afternoon, using AI to assist you 😅
Lets get the jailbreak video going..🎉🎉🎉
Qwen and DeepSeek Coder are both Chinese. It's amazing how they're slowly dominating everything.
Everything?
My picks for next 3 months period are INJ and FIL. But also, guys do not miss Cyberopolis presale, is almost over.
My got can even do real IMO math problems from only an image problem upload aswell
im ready to start using rumble. i made an account and followed you.
this can do pretty crazy coding aswell
anyone wants 10 x vision capabilities for chat GPT i figured out something open Ai forgot about, Its called Smart Vision image/text analysis (Paste your own custom instructions for a superior smart chat)
It is 10 x better at analysing images and especially reasoning relationships read the examples below
I uploaded an image of a cloud that looks like multiple things but it can be interpreted, the one i gave it was a personal photo I took and it has recognised it was a rabbit but not even a random human guessed it every time now on 1st shot so it knows when something is unusual about an image even if you dont say anything, it can also do iq test image reasoning pattern questions relatively well.
Another example is upload an image of a model with clothes that fit or don’t fit or you don’t know it can analyse that and tell you with great detail why it fits or doesn’t fit
It kind of even understands real logic games when giving good instruction but is limited there by the model not the instruction
*IMPORTANT* just gotta follow the instructions given to get the right seed its 1 in 2 chance or so i have absolutely no idea why it needs that. Just paste the conversation starter and you’ll understand what to do.
i dont understand what you write, What you create ? How to be used?
@@BienestarMutuo its pretty complicated but basically its like 10 x smarter than gpt 4o especially for analysing any uploaded images, its overall just better accuracy you can use it for anything, and tell it your custom prompts aswell
@@xd-qi6ry How i can test that ?
@@BienestarMutuo all you need to do is search up Smart Vision in gpt or the website, in the custom gpt section and it should show up with that orange eye image, it has 50 people usijg it, but for usecase it can be used as anything but be gpt 6 level actually
Just search Smart Vision image/text analysis in gpt store it’s an orange eye looking logo
I would imagine that most of these datasets are compromised in some way or another. I know for a fact that parts of HumanEvalJava is part of large training data sources like The Stack. IMO only new datasets or real usage can accurately evaluate these models. If the model came from out of nowhere and it consistently outperforms by that much, that is a red flag for me.
Yep, this was something I was highly suspicious of with DeepSeek Coder V2. However, I do think there are practical limits to how fast these benchmarks can be fooled / reverse engineered.
@@aifluxchannel They are not "fooled" per se, it's (i'm totally guessing) simply that they could be overfitting. LLM can start overfitting when the problems and solutions for the benchmarks have been included in large datasets used for training the model. Normally, you HAVE to separate training and evaluation data, but there are practical limitations when dealing with LLMs who needs vast data sources which continuously scrape vast amounts of data from the internet.
Worst case, in a scientific ethics context, is if they intentionally selected older benchmarks in their results for their better performance relative to other models, and older benchmarks are more likely to suffer from data leakage, but that is a more serious accusation.
I will try it myself nonetheless, the code quality of your example was really good at least compared to GPT :)
Everyone talking about the Cyberopolis launch best news this year.
Gaming tokens are booming. Cyberopolis will go 50x after the launch.
Do not underestimate the power of getting in early. If you are not in Cyberopolis now, you are lagging behind.
Gaming tokens are booming. Cyberopolis will go 50x after the launch.