Qwen Just Casually Started the Local AI Revolution

Поділитися
Вставка
  • Опубліковано 14 січ 2025

КОМЕНТАРІ • 334

  • @ColeMedin
    @ColeMedin  2 місяці тому +27

    Since I test Qwen-2.5-Coder-32b with oTToDev in this video, I wanted to say: Come join our Discourse community for oTToDev, our fork of Bolt.new!
    thinktank.ottomator.ai
    Also, I'm working on more super big things behind the scenes:
    ottomator.ai

    • @GiomPanot
      @GiomPanot Місяць тому

      @@ColeMedin Cool let me join the community. I can bring business and product manager experience but not my coding ;).

    • @DrCognitive
      @DrCognitive Місяць тому

      For the video card, I notice there are 12gb and 24gb versions. I'm assuming you need the 24gb version?

    • @QuizmasterLaw
      @QuizmasterLaw Місяць тому

      i found qwen to be sucky.

  • @MilesBellas
    @MilesBellas 2 місяці тому +105

    Uncensored and offline = the best

    • @ColeMedin
      @ColeMedin  2 місяці тому +6

      Agreed!

    • @cirtey29
      @cirtey29 Місяць тому +13

      ​@@tom-moroney Less censored than US made LLMs. You can very easily jail break it. Harder to jail break Claude or GPT

    • @fredmercury1314
      @fredmercury1314 Місяць тому +13

      @@tom-moroney Fortunately I've never needed a local coding assistant to know about Tianemen Square.

    • @DudeSoWin
      @DudeSoWin Місяць тому +1

      @@tom-moroney The Redactions & Retconning of a French Apotheosis that won't get out from up my ass? Nope none of that in the Chinese version!

    • @Yomi4D
      @Yomi4D Місяць тому +1

      ​@@fredmercury1314😂😂

  • @timothymaggenti717
    @timothymaggenti717 2 місяці тому +7

    Your an amazing young man, keep up the good work. I value your videos, I keep going back to them like a valuable library which they are. Thanks for the high value you bring to each video.

  • @jaredcluff5105
    @jaredcluff5105 2 місяці тому +18

    Best vid showing what Qwen 2.5 32b can do. I have it on my mac studio after watching your video and am looking very much forward to putting it in my pipelines.

    • @ColeMedin
      @ColeMedin  2 місяці тому

      @@jaredcluff5105 Thanks Jared! Sounds great, I hope it works great for you too!

  • @pmarreck
    @pmarreck 2 місяці тому +22

    This model arrived with perfect timing after obtaining my M4 Max w/128GB RAM. It works fantastic. I am so excited about what's next... and being able to run it all on this machine

    • @harshamv5113
      @harshamv5113 Місяць тому

      How heavy is the mac? Looking to buy one

    • @nicholasthon973
      @nicholasthon973 Місяць тому +1

      @@harshamv5113heavy? It’s roughly a hockey puck.

    • @AutoCannonSaysHi
      @AutoCannonSaysHi Місяць тому

      ​@@nicholasthon973You canadian?

    • @justindressler5992
      @justindressler5992 Місяць тому +2

      Ha mate, how fast is it for running coder 32b how many tokens per secon?

    • @newskybox
      @newskybox 20 днів тому +1

      128gb of ram on a Mac?? Are you alright????????

  • @patruff
    @patruff 2 місяці тому +8

    Seen some of your stuff before but just liked and subbed. Love the no fluff, no hype, actually useful approach. Thanks for not giving us a Snake app in Python and then just emoting about how amazing AI is.

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Thanks so much - I appreciate it a lot! :D

  • @vannoo67
    @vannoo67 Місяць тому +8

    I managed to get Qwen2.5-coder:30b running on my RTX 4070 TI Super (16G VRAM) on Windows 10. Note: It had to use Shared GPU Memory, so it ran pretty slowly, but it was able (for me the for the first time) to work effectively with oTToDev. (Actually get results in the preview window.) Woot! I also tried Qwen2.5-coder:14B, but it failed about as badly as Qwen2.5-coder:7b and Qwen2.5-coder-instruct:7b

  • @hrmanager6883
    @hrmanager6883 2 місяці тому +6

    Best one on Qwen 2.5 Coder 32 B , thanks for great sharing, wishing you best in future

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Thank you, you too!!

  • @hernanangular8124
    @hernanangular8124 Місяць тому +10

    🎯 Key points for quick navigation:
    00:00 *🚀 A breakthrough in AI with Qwen 2.5 coder 32b*
    - The release of Qwen 2.5 coder 32b marks a significant advancement in AI, particularly for open-source models.
    - Excitement surrounding the new model and its capabilities.
    - The speaker emphasizes local operation and cost-effectiveness of running the model.
    02:00 *🖥️ Hardware requirements for running Qwen 2.5 coder 32b*
    - Detailed explanation of how to set up local large language models using the olama platform.
    - Recommendations for hardware, including the necessity of powerful GPUs for optimal performance.
    - Options for different model sizes to accommodate varying hardware capabilities.
    04:00 *🛠️ Testing Qwen 2.5 coder 32b with coding tasks*
    - The speaker tests Qwen 2.5 coder 32b by creating a React chat interface, showcasing the model's effective handling of coding tasks.
    - Comparison with other local models that struggle with complex prompts.
    - Highlighting successes in iterative testing of the chat interface development.
    07:00 *⚙️ Performance comparison with Code Llama 34b*
    - A side-by-side evaluation shows Qwen 2.5 outperforming Code Llama in executing commands and handling input.
    - Revealing common weaknesses of other local models in handling tasks effectively.
    - The demonstration solidifies Qwen 2.5 coder 32b's superiority in practical applications.
    09:00 *☁️ Introduction to Novita AI platform*
    - Overview of Novita AI, a platform that simplifies the use of open-source language models.
    - Features include a flexible API and configurable GPU instances, enhancing deployment and application for AI projects.
    - The speaker shares personal experiences with Novita AI and its efficacy in managing AI infrastructures.
    11:00 *🔗 Developing an AI agent with Qwen 2.5 coder 32b*
    - Description of how an AI agent is constructed using Lang chain and Lang graph to test new models.
    - Testing the agent's ability to interact with various tools like task management systems and Google Drive.
    - Emphasis on the agent's performance and adaptability compared to other models previously tested.
    15:00 *💡 Conclusion on the future of local AI models*
    - The speaker reflects on the exciting advancements in local open-source AI and how Qwen 2.5 coder 32b stands out.
    - Urges viewers to recognize the diminishing gaps between closed-source models and robust local alternatives.
    - Encouragement for audience engagement and anticipation for further developments in the AI landscape.
    Made with HARPA AI

  • @PhuPhillipTrinh
    @PhuPhillipTrinh 2 місяці тому +13

    hey Cole great video as always! Curious for Mac machines and with the new m4 chips what would be the MIN requirement for the 32B model lets say? the Q models are an interesting touch that's for sure!

    • @ColeMedin
      @ColeMedin  2 місяці тому +4

      Thank you! For the new Mac machines, for Qwen-2.5-Coder-32b I'd recommend the M4 Max chip and at least 64 GB of unified memory.

    • @PhuPhillipTrinh
      @PhuPhillipTrinh 2 місяці тому +3

      @@ColeMedin damn you better write some legit code to get back $5000 CAD worth of tokens lol. Ah well good thing there's pay per use for us peasants :D

  • @JimWellsIsGreat
    @JimWellsIsGreat 2 місяці тому +11

    Interesting. I saw a couple other channels using Quen 2.5 32B and it failed at some simple to moderate code.

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      @@JimWellsIsGreat Wow that's interesting, could you call out a specific video or two? I've only seen good results from others so I'd love to check it out!

    • @figs3284
      @figs3284 2 місяці тому +11

      @@ColeMedin AIcodeking video, it seemed to fail in aider and cline. I'm not sure why, but I imagine it has something to do with the way they implement tools, file edits, etc. When I used it with cline it opened the file and started writing it, and then somewhere in the middle it just kept repeating text over and over. Looks like its working pretty well with ottodev though!

    • @Jblcc
      @Jblcc 2 місяці тому +2

      Yeah, same here saw the video was really disappointing to see. But your review turned out good

    • @mikekidder
      @mikekidder Місяць тому

      working pretty well on ollama (open-webui) and lm-studio.

  • @ChrisMartinPrivateAI
    @ChrisMartinPrivateAI 2 місяці тому +2

    Great Counter FOMO Operations video Cole ->> not letting the latest, super cool and sexy proprietary closed source AI feature or company think we're missing out with local AI. I am curious how much context control you have with Qwen: this file, this module, the entire code base. If "context is King" is true, then we creators need help adding the right amount of context for the task while being cognizant of the GPU resources available.

    • @andrepaes3908
      @andrepaes3908 2 місяці тому

      This model maximum 32k on direct mode using Ollama and 128k through indirect mode using VLLM

  • @lovol2
    @lovol2 Місяць тому +2

    Fantastic. Youce got pre prepped quality prompts that are realistic. Not just 'make snake game'. Subscribed.

  • @augmentos
    @augmentos Місяць тому +3

    Thank you really good review. I would mention it helps even though you mentioned Claude 3.5 at one point you never draw direct comparisons except two other local models. It would help always to express the Delta between any cloudbase service and the local model. Especially when hyping the local model. Makes it easier to consider if it’s worth a time expenditure.

    • @ColeMedin
      @ColeMedin  Місяць тому +2

      Yes very valid point, I will consider that for another video showcasing local LLMs with oTToDev!

    • @augmentos
      @augmentos Місяць тому +1

      @@ColeMedin You're awesome loving these efforts Cole

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Thank you, you bet!!

  • @DaveFriedel
    @DaveFriedel Місяць тому +2

    100% agree with Qwen. It’s outstanding

  • @WhyitHappens-911
    @WhyitHappens-911 2 місяці тому +3

    Good job with the eval agent!
    Do you plan to bring in an additional layer of agents, managed by an orchestrator?
    Bringing in the swarm concept into langgraph will be awesome

    • @ColeMedin
      @ColeMedin  Місяць тому

      Thank you! Yes certainly!!

  • @lancemarchetti8673
    @lancemarchetti8673 2 місяці тому +66

    I could swear you're Matthew Berman's son. :D

  • @provod_
    @provod_ Місяць тому +1

    It still failed a very basic test of "Write an OpenGL ES context init sequence for Nvidia Jetson Nano without using X11", in a very unsophisticated fashion, similar to 1-2yr old smaller open source models. I've yet to find a model that can do this, even with heavy human guidance. And I tried dozens of models, including state of the art commercial ones.

  • @davidtindell950
    @davidtindell950 2 місяці тому +3

    New Subscriber ! I have been testing this new Qwen 2.5 32b coder llm under Ollama for nearly 72 hours and It runs well on my old Dell G-7 !

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      That's fantastic!

  • @generated.moment
    @generated.moment 20 днів тому +1

    At this point learning lots of new stuff and AI fast progress pace, i need a chip that helps me digest everything out there

  • @shay_groups
    @shay_groups 7 днів тому +1

    can you please share the prompt that you used in minute: 05:10 ?

  • @michaelzumpano7318
    @michaelzumpano7318 Місяць тому +1

    Excellent video! You answered all my questions in the order I thought of them. Subscribed!

    • @ColeMedin
      @ColeMedin  Місяць тому

      Thank you very much! :D

  • @6tokyt468
    @6tokyt468 2 місяці тому +8

    It would be interesting if those who say they use it describe the hardware used.
    I'll start, Qwen 2.5-Coder 32B with Olllama (context 32K) on a very high-end PC, I9 14400K 6G 32Threads, 128GB DDR5, RTX 4090 24G, Nvme 4T.
    It runs but is relatively slow.
    It's up to you!
    [Edit] : Since this comment, I'm now at ~37 tokens/s. I'll share my settings with you if nobody nor me can make better.
    Quick question in passing, for you, an average of 37 tokens per second, is that very slow, slow, normal, fast, very fast!?

    • @andrepaes3908
      @andrepaes3908 2 місяці тому +7

      I am running the same Qwen 32b 4bit quantization, 32 context size on a Acer Swift go 14 laptop w/ cpu intel core ultra 7 155H with 32gb RAM, 1TB ssd and a Zotac 3090 24gb VRAM mounted on a EGPU connected through TB4 to the laptop. It is running this model at 25 tokens/s in first interaction diminishing down to 22 tokens/s with follow up interactions

    • @ColeMedin
      @ColeMedin  Місяць тому

      Thanks for sharing! Yeah that's good!

    • @joelv4495
      @joelv4495 Місяць тому +2

      I got 38 tokens/s running ollama prompts directly from the console on my 7900x/64GB/4090 gaming rig so that seems about right. I have the port exposed to my macbook and running the server from across the LAN seems significantly slower though (not sure how to benchmark it but I'd guess 2~3 tokens/s.

  • @FlintStone-c3s
    @FlintStone-c3s 2 місяці тому +2

    Will be interesting to see how well it runs on my Pi5 8GB. The last few LLM models have been pretty good.

    • @とふこ
      @とふこ Місяць тому

      They running fine even on android phone with koboldai termux. Just make sure to use gguf what optimized for ARM interference.
      On a strong enough phone with npu 12b model are possible.

  • @arnarnitistic
    @arnarnitistic Місяць тому +1

    maybe this has been asked already. if i dont have the capacity to get the specs as you have, where can i host the model the cheapest? (aside from local :) )

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Great question! I would suggest RunPod or Novita!

  • @jokelot5221
    @jokelot5221 28 днів тому +1

    What model would you recomend me to run on a SBC like Raspberry Pi 5. And can these smaller models like 3b be finetuned to do some agent tasks like switching GPIO pins on/off using pre-prepared Python scripts more efficiently than say, Llama models. Im interested in this beause i want to build some device using local LLM and an SBC like Pi. There are also more powerful options like Nvidia Jetson Nano that i would consider, and can that device run models with more parameters efficiently. The best i managed is to run Llama 3b on Pi, i created some small agent that turns the LED attached to one of the GPIO pins on/off on voice commands through whisper speech to text model and then responds through piper text to speech model. All of these models, including Llama3.2:3b model run on my Pi, but it still takes a while before they can respond. I managed to do some optimizations to speed them up, but i was wondering if Jetson would be able to help me upgrade this in speed and performance and whether using other llm models like qwen could make a difference.

    • @ColeMedin
      @ColeMedin  19 днів тому

      Yeah on a Raspberry Pi Llama 3.2 3b would be your best bet. With the Jetson you can run up to 8b parameter models though! So something like Qwen 2.5 7b would be a good bet.

    • @amoledzeppelin
      @amoledzeppelin 6 днів тому

      What about Qwen2.5-Coder-1.5B (q6)?

  • @philamavikane9423
    @philamavikane9423 2 місяці тому +2

    Been waiting for you to make this one and what it means for oTToDev. This makes things a lot more interesting
    Also saw you have a custom Qwen-2.5-Coder-32b for oTToDev.. what changes are you playing with?

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      It sure does!
      I haven't been playing with anything big yet, the main thing was increasing the context length since Ollama by default only has 2k tokens for the context length for any model. That isn't necessary for oTToDev anymore actually since we handle that config in code, but I needed that for my other agent testing.

  • @Techonsapevole
    @Techonsapevole 2 місяці тому +1

    Impressive localAI maybe won't be on par with online models but it's already pretty useful

  • @DiyEcoProjects
    @DiyEcoProjects Місяць тому +3

    ❤Hi Cole, just wanted you to know that your videos are appreciated, ive subbed today but ive looked at like 50 videos lol.
    Top quality, thank you Bro!. Also, if its ok, can i ask something please?
    Im new to this space... im trying to do something with R.A.G. Whats the best LOCAL LLM+RAG youve come across that doesnt sound like shit in its responses. I have a laptop with 32gb Ram and better than medium GPU, but not top of the range. (Asus gl703vm, intel i7-7700, and GTX 1060)
    Would appreciate some guidence please. Thank you, All the best

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Thank you so much!!
      With your computer I would try Qwen-2.5 14b instruct. It's still a smaller model so the performance might not be the best, but the whole family of Qwen-2.5 models is super impressive!

  • @GiomPanot
    @GiomPanot 2 місяці тому +2

    Thanks Cole! I am using Bolt.new with the paid version and it is quite a nightmare to prompt and prompt and prompt a lot , UI breaks a lot. Do you reco to switch to Qwen and Ottodev or wait a bit. I am not a dev but can build some stuff.

    • @ColeMedin
      @ColeMedin  Місяць тому +2

      You bet! Yeah I'd give it a shot! Or at least switch to using Claude 3.5 Sonnet with oTToDev since you have to use your own API key but at least won't hit rate limits.

  • @Human_Evolution-
    @Human_Evolution- Місяць тому +1

    Do local llms have internet access for their replies? I tried a random llm today that coulnt access the web so it didnt know what i was asking.

    • @ColeMedin
      @ColeMedin  Місяць тому +2

      They don't by default so you have to give it a tool to search the web as an agent. I actually do this in the video I am posting this Sunday!

    • @clutchboi4038
      @clutchboi4038 Місяць тому

      ​@@ColeMedinsubbed

  • @fsaudm
    @fsaudm 2 місяці тому +2

    Nice!! Though, how about vLLM instead of Ollama?

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Great suggestion! We don't support that now but that would be a good addition!

  • @jasonshere
    @jasonshere Місяць тому

    This looks great. How does it compare to something like Windsurf? They use Claude 3.5, etc., but of course there are more restrictions on how it can be used.

    • @ColeMedin
      @ColeMedin  Місяць тому

      Thanks! I prefer oTToDev/Bolt.new for building frontends and Windsurf for everything else!

  • @simplepuppy
    @simplepuppy 2 місяці тому +1

    your fork is awesome! I wish it could help in deployment, which is the one of biggest pain points.

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Thank you! And that is one of the features we are hoping to implement really soon!

  • @lakergreat1
    @lakergreat1 2 місяці тому +1

    further to your commentary, I would love to see a spec build of your rig on pcpartpicker or something

    • @ColeMedin
      @ColeMedin  2 місяці тому

      I am planning on sharing that soon!

  • @vladimirrumyantsev7445
    @vladimirrumyantsev7445 Місяць тому +1

    Hello, thank you for such valuable content 👍

  • @shotybumbati
    @shotybumbati Місяць тому +1

    I just tested Qwen 14b and it on a 6800XT and it was so fast and blew me away

  • @anianait
    @anianait Місяць тому +2

    does the bolt ... and ottodev support gguff models ?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Yes because you can set them up in Ollama!

  • @rastinder
    @rastinder 2 місяці тому +3

    Is it possible to edit existing project in your bolt.new ?

    • @ColeMedin
      @ColeMedin  2 місяці тому +2

      Not yet but that is something we are looking to have implemented very soon for the project!

  • @diszydreams
    @diszydreams Місяць тому +1

    Thanks for the vid! was really good.

  • @andrepaes3908
    @andrepaes3908 2 місяці тому +1

    Great content! Keep it going!

  • @jesusjim
    @jesusjim 2 місяці тому +5

    did you change the contect window for the model or did you just use the standard 2k in otto dev?

    • @jesusjim
      @jesusjim 2 місяці тому +2

      i meant this : PARAMETER num_ctx 32768?

    • @ColeMedin
      @ColeMedin  2 місяці тому +3

      @@jesusjim Great question! I did change the context window to make it bigger. Not necessary for oTToDev anymore since we included this parameter within the code to instantiate Ollama, but it was necessary for me to do for my custom coded agent.

  • @MediaMentorAI
    @MediaMentorAI 2 місяці тому +1

    Love your content! Question for you, I don't have access to a home AI server, using cloud GPU through runpod, I can upload the full 32b qwen model there, how can I link to your bolt.new fork?

    • @MediaMentorAI
      @MediaMentorAI 2 місяці тому +1

      @ColeMedin actually, just signed up for the discourse, ill get my answer there :)

  • @amit4rou
    @amit4rou Місяць тому +1

    Hey! can you suggest, if qwen2.5 32b would work with my setup i9 13980HX RAM-32 DDR5 5600MHz + RTX 4060 8GB VRAM?
    Very excited to hear this, probably searching for something like this.

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Great question! Unfortunately 8GB of VRAM probably isn't enough to run a 32b parameter model. You could try running Qwen 2.5 Coder 14b though!
      Otherwise you can access the 32b parameter version through OpenRouter or HuggingFace if you aren't able to run it on your machine.

    • @MauroC74
      @MauroC74 Місяць тому +1

      use LM Studio, it will tell you if you can run the model

  • @YoungTeeke
    @YoungTeeke 2 місяці тому +2

    Yo! I don’t have a gpu capable of running this but I do know you can rent gpus for a few hours or days or whatever. Think there is a Linux distro that we could upload to a rented gpu with all the configuration to run qwen 32 and deploy quickly to utilize the gpu and then export that code out of the machine. Maybe a Virtualbox or VM that’s ready to go off the rip? I’m still struggling on how to get my local llm setup and running because I ate paint chips when I was little

    • @YoungTeeke
      @YoungTeeke 2 місяці тому +1

      Maybe even an email preconfigured with a proton mail or something so people could email out the code they generate on rented gpu or servers? This would be really really cool and helpful and it’s common in the crypto world. Especially for the new gpu / cpu hybrid chains for mining

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      I love your thoughts here! So basically you are thinking of a way to have a machine in the cloud that isn't accessible to everyone directly, but people would have a way to send requests into the LLM with something like email to use Qwen-2.5-Coder-32b for larger coding tasks?
      You could definitely do this even with just a GPU instance running Qwen-2.5-Coder-32b behind an API endpoint that you could create a frontend around! Or just host oTToDev itself for people to use which is something I am looking into doing!

    • @YoungTeeke
      @YoungTeeke 2 місяці тому +1

      @ yeah a lot of people are renting clusters in the xen and x1 community which is gpu intensive. I think they were renting out rigs on… I forget but basically make it so someone could download a VM that’s preconfigured to run ottodev - save the virtual machine. Rent a server (or in your case your pc) out for isolated gpu access and run that VMware on your rent gpu server. Then email or Dropbox the files you create back to your regular system before your gpu rental rig expires

  • @NotSure416
    @NotSure416 Місяць тому +2

    Fantastic! I build a decent 3090 machine. Time to put it to use!

  • @PyJu80
    @PyJu80 2 місяці тому +1

    Sorry... me again. 😂
    So.. my brain was thinking about GPU and resources with LLMs. What I was thinking is.. if say qwen had 10 languages. Could you essentially split that model into ten and give each language to an agent to specialise in. Like a advanced swarm of agents. So when you run a prompt, each agent is specialised in each language. But come together to dev the code. So each model only uses the GPU for its part of the code? Maybe use a large LLM just for structure. Then the agents to produce the code one at a time.
    I'm really about quality over speed for my agents. I'd happily prompt in the morning and have an all singing and dancing in the night. Thoughts?

    • @ColeMedin
      @ColeMedin  2 місяці тому +2

      Yes you could certainly do that, I love your thoughts here! You would just have to have a router agent that takes in each request and determines which agent to send the request to based on the language.

    • @autohmae
      @autohmae Місяць тому

      This would need fine tuning, it would take a bunch of GPU power, but would only be needed ones.

  • @voxstar1067
    @voxstar1067 2 місяці тому +2

    Hey Cole, can you connect it to your existing repo and ask it to define new functions? Can it build IOS apps?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Not yet but these are features we want to implement soon for sure!

  • @ArmadusMalaysia
    @ArmadusMalaysia Місяць тому

    how does it do with translations, and uploading of audio for transcriptions?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Translations it does well! For uploading of audio for transcriptions, I haven't tried this but I'm not sure it supports taking in audio directly.

  • @TheEntrepwneur-zn1wh
    @TheEntrepwneur-zn1wh Місяць тому +1

    Would a 4080 work in place of a 3090?

    • @ColeMedin
      @ColeMedin  28 днів тому +1

      Good question! A 3090 is a good amount better than a 4080 specifically for LLM inference because it has 24GB of VRAM instead of 16GB so you can fit larger models on the card.

    • @TheEntrepwneur-zn1wh
      @TheEntrepwneur-zn1wh 19 днів тому

      @@ColeMedin Got it thanks man

  • @gamez1237
    @gamez1237 2 місяці тому +5

    Should I be fine with a 3060 12gb and 64 gbs of ram? I am able to run codestral just fine

    • @zikwin
      @zikwin 2 місяці тому +2

      limited to 14B like mine

    • @gamez1237
      @gamez1237 2 місяці тому

      @ i’m able to run 22b models just fine

    • @AaronBlox-h2t
      @AaronBlox-h2t 2 місяці тому

      Reallly....gotta go see what that was all about

    • @ColeMedin
      @ColeMedin  2 місяці тому

      @@gamez1237 I would try for sure, but it might be pushing it a bit! You could always try the Q_2 version as well that I referenced in the video.

  • @GuilhermeHKohnGAnti
    @GuilhermeHKohnGAnti 2 місяці тому +3

    Can we use qwen 2.5 32B hosted in a huggingface space in ottodev? It would be awesome since not everyone has a gpu that supports 32B parameters models.

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      @@GuilhermeHKohnGAnti Great question! HuggingFace isn't supported for oTToDev and I'm not sure if it would be a practical implementation like some of the other providers, but it would be good to look into! You can use Qwen 2.5 Coder 32b through OpenRouter though and it's super cheap, as cheap as GPT-4o-mini

    • @IunahYT
      @IunahYT Місяць тому

      actually looks like you can, just wrap your preferred Hugging Face space API inside an OpenAI-like API server that you can get from an AI

  • @thanomnoimoh9299
    @thanomnoimoh9299 Місяць тому

    Is it possible to install Qwen-2.5 on a server running on VM with MS Windows? to assist in writing code or scripts in various languages such as SQL, PostgreSQL, and Python and then use the generated scripts with programs like DBeaver to manage data from different databases on the server. However, this enables Qwen-2.5 to learn every field in the tables within the databases on the installed server because our office wonders about data privacy you have a better way to do all the above or install only LLama 3 (70b) is just enough for all

    • @ColeMedin
      @ColeMedin  Місяць тому

      Yes you certainly can! And if you are running locally it doesn't matter what it learns, your data is still private! Running Llama 3 is just as private as running Qwen!

  • @DikoJelev
    @DikoJelev Місяць тому

    Lovely video. Thanks for sharing and I have two questions.
    1/ Can I create with an AI a simple script on AutoIt or any other script platform for automation?
    2/ And is there a possibility to make my own AI for selecting photos based on what it has learned from my previous selections of other photos?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Thank you!
      1. I'm not actually familiar with Autolt, but my favorite platform for no/low code AI agent creation in n8n! Also just good old fashion Python with Langchain.
      2. Yes you certainly could! I would use Llama 3.2 90b vision for that!

  • @ChaitanyaDewangan-rg1cm
    @ChaitanyaDewangan-rg1cm 11 днів тому +1

    LOVED IT ❤

  • @ShaunPrince
    @ShaunPrince Місяць тому +1

    Great review here. Exciting model. I hope this channel blows up! Tired of all the how to use python videos.

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Thank you so much - that means a lot!

  • @protzmafia
    @protzmafia Місяць тому

    Been watching this for a bit and decided to take it for a spin today with some free time. Any initial thoughts in getting things to run a bit faster? I'm running the 7b since my system a little older (2070 super and Ryzen 7 3700X). When trying to build some code in the ui running in docker it's not TERRIBLE, but definitely nowhere near as fast as yours. Would that just be hardware limitations at this point?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Are you sure Ollama is using your GPU? Does the usage go up for the GPU (you could check in the task manager) when the LLM is running?

  • @ContentVibeio
    @ContentVibeio 2 місяці тому +1

    This is awesome Cole 😂🎉

    • @ColeMedin
      @ColeMedin  2 місяці тому

      @@ContentVibeio Thank you 😁

  • @AlejandroGonzalez-vw1nh
    @AlejandroGonzalez-vw1nh Місяць тому +1

    Super intrigued if this model would work with the windows ARM machines like the surface pro! Has anyone tried it? I imagine that with all the copilot stuff it would function quite well

    • @ColeMedin
      @ColeMedin  Місяць тому

      I haven't tried this myself but I'd be curious too!

  • @SSmartyimages
    @SSmartyimages Місяць тому

    Did you try windsurf?

  • @TomHimanen
    @TomHimanen 2 місяці тому +3

    Could you create a video about building different level PC builds for running AI locally? For example, I dunno if 2 x Nvidia RTX 3080s is better than 1 x Nvidia RTX 4080. Also, it's diffucult to understand the bottlenecks; is it enough to have a beefy GPU? I would like to invest all my money as effectively as possible, and this is relevant right now because Black Friday is coming. 😁

    • @ChrisTheDBA
      @ChrisTheDBA 2 місяці тому +1

      I would like to just know how you are getting the code to run on a specific GPU if possible or it just the first GPU?

    • @ColeMedin
      @ColeMedin  2 місяці тому +3

      I really appreciate this suggestion! I'll certainly consider making this video before Black Friday, though I'd have to figure out how to fit it into my content calendar!
      2x of a slightly weaker model compared to 1x of a slightly stronger depends a lot on the models you want to run. If a model can fit into one GPU, it'll be much faster than having to split between 2 GPUs. That all depends on the VRAM of the GPUs to determine if a model can fit.

    • @TomHimanen
      @TomHimanen 2 місяці тому +1

      @ColeMedin Thanks for summing up the the key differences of 1 and 2 GPU setups! Also great to hear that we might get good hints for making our Blck Friday shopping lists. Thanks for great content and taking local AI into account!

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      You bet!!

  • @liuoinyliuoiny5905
    @liuoinyliuoiny5905 2 місяці тому +1

    This is awesome, What about opencoder ?

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Thanks! I'd love to try Opencoder too!

  • @VaibhavShewale
    @VaibhavShewale 2 місяці тому +2

    looks awesome

  • @modoulaminceesay9211
    @modoulaminceesay9211 2 місяці тому +1

    Can you try the 14b please

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Yes I certainly will be diving into the 14b version as well!

  • @Singlton
    @Singlton Місяць тому

    If I want to build chat app with saving history, how to do it with ollama

  • @RyanStephenAlldridge
    @RyanStephenAlldridge 2 місяці тому

    I just cant seem to get a preview window going, is there a place for troubleshooting this particular version? Im a noob as you can see.

    • @ColeMedin
      @ColeMedin  Місяць тому

      Usually when the preview doesn't show it means the LLM hallucinated a bad command or bad code. I talk about it in my latest video I just posted!

  • @pramodgeorgehq
    @pramodgeorgehq 2 місяці тому +1

    Does it do dart language well?

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      I haven't tested myself with Dart yet actually! One other person reported it not doing the best with Dart but I'd try yourself and see!

  • @TeemuKetola
    @TeemuKetola Місяць тому

    Will one 4090 with 128gb ram be enough to run Gwen 2.5 coder 32B quantized version, or do you need minimum of two 3090s?

  • @ivaldirbatalha5436
    @ivaldirbatalha5436 Місяць тому

    Cole, btw, what's your pc setup like to run these models locally?

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      I built a machine around two 3090s, that's what gives me the power to run all of these local models!

  • @benkolev4290
    @benkolev4290 2 місяці тому +2

    Nice the Project Is getting better but I "only" got the reasources to run 14 B models sad

    • @AaronBlox-h2t
      @AaronBlox-h2t 2 місяці тому +2

      What is your VRAM, RAM, and NVME SSD? With custom code, you can split the load betwwen VRAM, RAM and NVME SSD. I'm doing that in windows but my NVME is 7400MB/sec DirectStorage enabled so it is fast. I'm able to run qwen2.5 32B bf16 which is not quantized. No ollama no LM Studio

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      @@benkolev4290 14B is still great! And you can use OpenRouter if you want to access larger open source models through an API!

    • @OscarTheStrategist
      @OscarTheStrategist 2 місяці тому +1

      @@AaronBlox-h2t that’s super interesting! What are your tokens per second metrics and time to first token, if you don’t mind me asking?
      Thanks!

    • @igorshingelevich7627
      @igorshingelevich7627 2 місяці тому

      Как в Луцке дела?

    • @benkolev4290
      @benkolev4290 2 місяці тому

      @@igorshingelevich7627 OK + -

  • @ahmadmanga
    @ahmadmanga 2 місяці тому +1

    What are the minimum requirements to run this model at an acceptable performance? My PC isn't really powerful, and I'd like to know the minimum upgrade I need to run AI models.

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      I've heard some people running this model well even with a 1070 GPU! Personally I'd recommend a 3080/3090 for best performance.

  • @P-G-77
    @P-G-77 Місяць тому +1

    I thought downloads of this type would be much heavier, like 50-80GB+ ANYWAY... i view very nice results... AND locally... fantastic.

  • @aramirezp92
    @aramirezp92 Місяць тому

    What is the best OS to host a model. Ubuntu ?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Yeah I would recommend Ubuntu!

  • @TomHermans
    @TomHermans 2 місяці тому +2

    I like your videos.
    Just a heads up.
    Vite is pronounced veet, like feet but with a v. It's french for Fast.

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      Ahh yes, thank you for pointing this out!

  • @esuus
    @esuus 2 місяці тому

    But at 18ct in/out for the API, you could still run it all the time agentically, right? Or any other reasons why you need to run it locally for the agentic stuff you're talking about?

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Yeah that is true, though the cost still does add up! Running models locally also helps when you need private AI (your company has compliance requirements, you have intellectual property you want to keep safe, etc.) and you can fine-tine local LLMs on your own data!

  • @viorelteodorescu
    @viorelteodorescu 2 місяці тому

    Please mention hw requirements for Mac too...😎

    • @ColeMedin
      @ColeMedin  2 місяці тому

      To be honest I'm not an expert at Mac, but I'd say you'd want at least 64 GB of unified RAM with the new M4 chip.

  • @davronbekboltayev777
    @davronbekboltayev777 6 днів тому

    How much your computer ram for run 32b

    • @ColeMedin
      @ColeMedin  5 днів тому

      I've got 128GB of RAM, though that isn't generally the bottleneck for running LLMs for inference. Typically the bottleneck is the GPU - and I run this model on a 3090 GPU.

  • @tomwawer5714
    @tomwawer5714 Місяць тому +1

    I was surprised I could run 32b on my old 1070Ti 6GB yay!!! You don’t exaggerate it’s a beast

    • @brulsmurf
      @brulsmurf Місяць тому +1

      some layers of the model is running on your GPU, most are running on your CPU

    • @tomwawer5714
      @tomwawer5714 Місяць тому

      Sure thing!

    • @ColeMedin
      @ColeMedin  Місяць тому

      That's awesome!!

  • @hasstv9393
    @hasstv9393 2 місяці тому

    After entering the prompt, it just writing the codes in the chat interface instead of code interface, Why? Never get this worked with ollama

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Which model are you using? Really small models will do this because they can't handle the bolt prompting.

  • @MrI8igmac
    @MrI8igmac Місяць тому

    Im trying to upload a folder full of rails 8.0.0 tutorials. I extracted all plain text from html documents. Im testing with qwen coder 32b. But im not getting the llm to pass a coding task.

    • @ColeMedin
      @ColeMedin  Місяць тому

      Could you clarify what you mean? It's not performing as well as you want?

    • @MrI8igmac
      @MrI8igmac Місяць тому +1

      @ColeMedin im running linux. My projects involve ruby, gems, and rails. If you run
      'find / -iname *.md'
      You can find all the instruction documents for any library.
      /usr/share/doc/ruby/README.md
      Gems/bootstrap/README.md
      So with 1 thousand readme documents uploaded to open-webui, i can start conversations and building projects.
      My mistake was asking my chatbot to create a bootstrap styled web website. rails 8 is not documented enough for any chatbot to produce working instructions. So this was an advanced task that failed with all llms i tested.

  • @LyuboslavPetrov
    @LyuboslavPetrov Місяць тому +1

    In my setup I do not get any code written in the editor - using ollama with qwen-1.5B, since the 7b model does not fit on my GPU. Anybody else seeing the same?

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Yeah unfortunately models

  • @lucascoble
    @lucascoble Місяць тому

    What does 1x 3090 performance look like?

    • @ColeMedin
      @ColeMedin  Місяць тому

      It'll be similar actually since Qwen-2.5-Coder-32b can fit into one 3090 (if you download the default one from Ollama which is Q4 quantized).

  • @SxZttm
    @SxZttm 2 місяці тому

    Hey. For some reason whenever I ask ollama to create me something it gives me code instead of using the implemented code editor. I did what you told us to do to fix it by changing the message length but it still didn’t work

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      Interesting... I'm guessing the model you are using is too small (3b/1b). Which model are you using?

  • @ariyako
    @ariyako Місяць тому +1

    smooth run on mac studio M1 ultra 64GB LM Studio + AnythingLLMA

  • @frosti7
    @frosti7 2 місяці тому

    11:57 can Monday be used instead of Asana?

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      Yeah certainly! They have a great API too!

  • @JJBoi8708
    @JJBoi8708 2 місяці тому

    Can you pls try the 14b? 🙏🏻

    • @ColeMedin
      @ColeMedin  2 місяці тому

      I am planning on doing this in the near future for more content!

  • @John-ek5bq
    @John-ek5bq 2 місяці тому

    I do not see qwen 2.5 in bolt any llm local fork from Open Router. To run it with bolt, do I need to install Olama and qwen 2.5 ?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Check again! There was an update to add it!

  • @sharplcdtv198
    @sharplcdtv198 2 місяці тому

    I have a very similar system with 2x3090s and a threadripper 3960x. My qwen qwen2.5-coder:32b-q8_0_in32k doesn't do the things yours does..it fails miserably like your second tested model.
    Just tested now: qwen2.5-coder:32b-q8_0_in32k: 2.2 tokens/sec vs qwen2.5-coder:32b-base-q8_0 : 19 tokens/sec in ollama promt. So increasing the context window to 32k makes the inference way slower

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      Yeah it makes sense because with oTToDev the prompt is actually bigger than the default context size of 2k tokens! Does it fail for you even with the increased context limit?

    • @richardl4936
      @richardl4936 2 місяці тому +2

      mine seems to be working in a singel 3090 with PARAMETER num_gpu 100
      PARAMETER num_ctx 15000
      PARAMETER num_batch 64
      PARAMETER num_thread 10
      SYSTEM OLLAMA_NUM_PARALLEL=1
      Anything more than that and it crashes. But the speed is great because it just fits in the vram. Even when using up all 15k of context.
      So if you need more than 15k context and you want full GPU, this is not the model for you.

  • @pensiveintrovert4318
    @pensiveintrovert4318 2 місяці тому +8

    I tested it with Cline, and no, it is not as capable as Claude 3.5 Sonnet. It has problems understanding and properly responding to Cline.

    • @ColeMedin
      @ColeMedin  2 місяці тому +10

      Thanks for testing that out! To be fair, with Cline it's probably prompted under the hood specifically to work with Claude models. I know from oTToDev and other experiences it definitely is beneficial to prompt different models in different ways.

    • @pensiveintrovert4318
      @pensiveintrovert4318 2 місяці тому +1

      @ColeMedin well, have to find the system prompt for Claude, to tweak Cline then. A totally free system would be great for experimentation.

    • @zanabalMuhamed
      @zanabalMuhamed 2 місяці тому

      How did you test it with cline , test only accepts multimodals right ?

    • @slavanastya
      @slavanastya 2 місяці тому +3

      I'm using Cline many hours each day, and my experience is that it works well with Claude, not bad with GPT family, but has difficulty with other models. Just the nature of the tool, if you look under the hood. I'm looking into forking it for experimenting more and getting results using different models

    • @pensiveintrovert4318
      @pensiveintrovert4318 Місяць тому

      A bit of an update. With a template and system prompt from Ollama repo for qwen2.5-coder-32b-instruct.8Q_0, and tweaking the context size to fit my hardware, it VSCODE + Cline + Ollama + Qwen actually works decently now. It helped me figure out why one of the tools was not working, by actually creating experimental bits of code, on its own initiative, I asked a general question why one folder had a problem and another didn't.

  • @codephyt
    @codephyt 2 місяці тому

    That Failure also happens to me with 7B model :( i have higher than recomended system for even 14b model instalation made perfectly but i guess you gotta work on your fork about bolt.new so i fixed the issue by refreshing the container from docker or the page restart generally works but u gotta work on it man.

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      We are working on the prompting for sure, but really it comes down to a lot of local LLMs still struggle with larger prompts necessary for Bolt!

  • @xXWillyxWonkaXx
    @xXWillyxWonkaXx 2 місяці тому

    I’m curious though, is this new coder good at Python/I/ML programming or mostly Java/typescripts/etc… ?

    • @ColeMedin
      @ColeMedin  2 місяці тому

      It's really good at Python/ML programming!

  • @live--now
    @live--now 2 місяці тому

    can u code a react native / node.js mobile app with that bolt??

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Bolt is just meant for web development right now! But this would be so cool to have in the future!

  • @danaaron5883
    @danaaron5883 Місяць тому

    Hey guys! Must I set up my GPU to ensure that LLM's are utilising its computational power correctly? I am using an RTX 3070 :)

    • @ColeMedin
      @ColeMedin  Місяць тому

      Ollama will use your GPU right out the gate!

  • @sciencee34
    @sciencee34 2 місяці тому

    What hardware are you using

    • @ColeMedin
      @ColeMedin  2 місяці тому

      I have two 3090 GPUs and 128 GB of RAM!

  • @DesiChichA-jq8hx
    @DesiChichA-jq8hx 2 місяці тому +5

    Yeah, but it failed for me. I used Qwen Coder for my Flutter project, and it messed up all the Dart code. 3.5 Sonnet is just on another level in comparison.

    • @AaronBlox-h2t
      @AaronBlox-h2t 2 місяці тому +1

      That sucks... I had good experience with qwen2.5 72B and 3.5 sonnet. I'll try out 32B today though. To be honest, 3.5 Sonnet was good also except for my Swift projects.

    • @ColeMedin
      @ColeMedin  2 місяці тому +3

      @@DesiChichA-jq8hx Super interesting, I appreciate you sharing! I wonder if you found a big issue with Qwen 2.5 Coder where it isn't fine tuned on much Dart code? Certainly could be, I'm not surprised that Sonnet is still clearly better with some languages.

    • @igorshingelevich7627
      @igorshingelevich7627 2 місяці тому

      ​@@ColeMedinAiCodeKind channel directly said. The model is bad.
      Your channel - the model is good.
      Which channel goes to veilguard then?

    • @MiraPloy
      @MiraPloy Місяць тому

      @@igorshingelevich7627aicodeking is a moron, that video should be disqualifying for any authority he has in your eyes

  • @anandkanade9500
    @anandkanade9500 2 місяці тому +1

    would be wonderful if users have service like fusion of ottodev and novita ai , choose gpu, choose llm , then create projects .....

    • @ColeMedin
      @ColeMedin  2 місяці тому +1

      Yes!!! That is certainly one of the end goals!

  • @AshWickramasinghe
    @AshWickramasinghe 2 місяці тому

    32B still. That's large. I'll be excited if anything below 12B can be as good. That's the point I'd say, that the federated Agentic model wins.

    • @ColeMedin
      @ColeMedin  2 місяці тому

      Yeah that's totally fair!

  • @venkatesanr8522
    @venkatesanr8522 2 місяці тому

    I have a 4090 card with 16gb vram and 128gb ram. Will 32b work on my machine.

  • @visionaryfinance
    @visionaryfinance Місяць тому

    Would it work in a Mackbook Pro M3 Pro 18gb ram 512gb ssd?

    • @ColeMedin
      @ColeMedin  Місяць тому

      Maybe not quite... but I would try it out and at least use the 14b parameter version for sure!

  • @phobes
    @phobes Місяць тому

    I tried Ollama, LM Studio is significantly faster for some reason. I could be doing something wrong though lol

    • @ColeMedin
      @ColeMedin  Місяць тому +1

      Interesting! Maybe Ollama wasn't using your GPU for some reason and LM Studio was?

    • @phobes
      @phobes Місяць тому

      @@ColeMedin I thought that as well, but I confirmed it was using the GPU. I just bought a 3060 with 12GB of RAM, will test with that running Linux and see if it makes a difference.

    • @ColeMedin
      @ColeMedin  Місяць тому

      Huh that is weird... let me know how the testing goes!