Optimize Your AI Models

Поділитися
Вставка
  • Опубліковано 26 гру 2024

КОМЕНТАРІ • 76

  • @MarcioLena
    @MarcioLena 4 місяці тому +20

    I need to say I’ve learned more from your videos than any other channel. thank you! Fantastic job. 🎉

  • @solyarisoftware
    @solyarisoftware 4 місяці тому +2

    Hi Matt, this was a very useful explanation. It may have just scratched the surface of parameter tuning, but you clarified the difference between the maximum, default, and explicitly set context window sizes (via the MODEL file), which is great and answers some of the questions I had asked in other video comments. However, I would still propose a dedicated session explaining in detail how to balance context window size and other parameters, considering RAM usage. It's important to be clear that context window length impacts the amount of RAM required (likely in gigabytes), and since RAM is a shared resource dependent on model parameter size, a specific session in an Ollama course explaining all these trade-offs would be valuable.
    Another related suggestion is to dedicate a session in the Ollama course to models designed for conversations (chat). The conversation capabilities of each model depend on specific fine-tuning for conversational tasks (as opposed to one-shot question-answer tasks), and conversations involve context window RAM.

  • @phiarchitect
    @phiarchitect 4 місяці тому +5

    the best overview on Ollama parameters

  • @NLPprompter
    @NLPprompter 4 місяці тому +6

    Matt, could you share your thoughts on the concept of 'context caching' as employed by Gemini and Anthropic? How might it impact Ollama?

  • @dwrout
    @dwrout 4 місяці тому +2

    Shoutout to Matt for another fantastic video! 👍 I've been exploring Ollama recently, pushing it with larger documents and was blown away by what my updated setup can handle. Using Mistral-Nemo models 128k context on a dual 16GB RTX 4060Ti, here's what I found:
    - Single GPU: 7168 context size - gave a reliable 20 tokens/second, and then with a 32k context, it plummeted to just 5 tokens/second.
    - Dual GPUs: With the same 32k context, I'm now cruising at around 19 tokens per second and there's still head room to increase it further!

    • @technovangelist
      @technovangelist  4 місяці тому +3

      If the model needs more memory then the two gpus helps. But if it fits in one then the two gpus will slow it down

  • @КравчукІгор-т2э
    @КравчукІгор-т2э 2 місяці тому

    Thanks to the professional and interesting way you tell your stories, you have more subscribers. I wish you an early 1,000,000 !

  • @AmrAbdeen
    @AmrAbdeen Місяць тому

    hands down the best video on the subject matter

  • @jelliott3604
    @jelliott3604 4 місяці тому +3

    This is excellent, very informative and useful - qualities that output from other channels frequently lacks.
    (I feel inspired to RTFM!)

    • @technovangelist
      @technovangelist  4 місяці тому +1

      Great to hear! Thanks

    • @jelliott3604
      @jelliott3604 4 місяці тому

      @technovangelist it is refreshing to hear someone who actually has knowledge of the subject talking about it without having to resort to hype and hyperbole

    • @aliveandwellinisrael2507
      @aliveandwellinisrael2507 Місяць тому

      A case of "garbage in, garbage out", or is the model simply trained better? :)

  • @MarceloPlaza
    @MarceloPlaza 4 місяці тому +3

    Thank you for the great explanation about the parameters. Hope you can share some practical examples from your experience. Best for coding, chat, research and so on.

  • @dontrez8412
    @dontrez8412 3 місяці тому +1

    Wow. I didn't even know some of that stuff existed. Thanks Matt!

  • @c0t1
    @c0t1 4 місяці тому +4

    Great video, Matt!

  • @razvanab
    @razvanab 4 місяці тому

    This was very helpful, sir. More like this, please. Thank you.

  •  4 місяці тому

    Very good explanation and very good video presentation. I congratulate you

  • @EriCraftCreations
    @EriCraftCreations 4 місяці тому +1

    Thank you for making this video. It was educational and I subbed.

  • @Sid_Okay
    @Sid_Okay 2 місяці тому

    Best video on the topic ever, on point and understandable. Thanks!

  • @JustinJohnson13
    @JustinJohnson13 4 місяці тому +1

    Excellent explanation! Thank you.

  • @AliAlias
    @AliAlias 4 місяці тому

    Thanks very much ❤🌹
    Great tutorial 🙏🙏🙏

  • @romulopontual6254
    @romulopontual6254 4 місяці тому +1

    Another great video! Thank you.

  • @AnimusOG
    @AnimusOG 4 місяці тому +2

    THANK YOU MY MAN, I needed this.

  • @jaggyjut
    @jaggyjut 3 місяці тому

    This is gold. Thank you for explaining this topic.

  • @avantepec
    @avantepec 4 місяці тому +1

    Great video as usual

  • @MeinDeutschkurs
    @MeinDeutschkurs 4 місяці тому

    I use python lib with 128000 num_ctx, and I get 1012 tokens back, with cut context. I have 192GB ram on macOS. Could there be a bug? I use options=dict(num_ctx=128000)

    • @technovangelist
      @technovangelist  4 місяці тому

      Perhaps a better question for either of the discords. Links in the description

    • @MeinDeutschkurs
      @MeinDeutschkurs 4 місяці тому

      @@technovangelist, I’m not on discord. I’ll ask this at the python ollama github site.

    • @technovangelist
      @technovangelist  4 місяці тому

      Pretty easy to join. But what’s the code you are using. Just for that call , not the whole program

  • @AlfredNutile
    @AlfredNutile 4 місяці тому

    Great work explaining all this thanks!❤🎉

  • @mbottambotta
    @mbottambotta 4 місяці тому +1

    this is an outstandingly useful video, really well made. thank you Matt!
    I have not been able to find a way to set these parameters in the Ollama Python module. did I miss it, or is that simply not possible?

    • @technovangelist
      @technovangelist  4 місяці тому

      Absolutely it is possible. I’ll come up with an example.

  • @yonnierenton6177
    @yonnierenton6177 4 місяці тому +3

    Lots of info done quick, clear with examples! very cool cheers. 🦙

  • @wvanginkel5572
    @wvanginkel5572 Місяць тому

    Great video, Matt and thanks for doing this! Question for you. Setting the temperature close to 0 or 0 altogether is a way to get more 'deterministic' behaviour regarding token generation by changing the probability distribution of the next token. You also mentioned that you can set the seed for the random number generator making the LLM essentially deterministic as the token generation will be the same given the start token (if I understand correctly). How does temperature and seed relate to each other?

    • @technovangelist
      @technovangelist  Місяць тому

      Seed is the important value for getting more deterministic but then your better option is just pulling from a db. It will always get the same value and be many times faster

  • @SK-hn8wo
    @SK-hn8wo Місяць тому

    I run llama3.1 via ollama. My GPU VRAM is 48GB but ollama only using 7562MB (i checked it from nvidia-smi command on ubuntu)
    How to fully use GPU memory and increase llama output speed.

  • @ElimaneYassineSEIDOU
    @ElimaneYassineSEIDOU Місяць тому

    Hi, thanks for the video. Can we perform a Monte Carlo Dropout on model by using ollama?

  • @anujdatta1657
    @anujdatta1657 4 місяці тому

    Could you explain where I type in the ''FROM llama3.1
    PARAMETER num_ctx 131072"?
    You look to be in some sort of terminal.
    Thanks

    • @technovangelist
      @technovangelist  4 місяці тому

      in the modelfile to create a model

    • @anujdatta1657
      @anujdatta1657 4 місяці тому +1

      Thank you.
      For anyone that dosnt know how to do this on windows.
      1. Open your IDE or even notepad
      2. Type
      FROM llama3.1
      PARAMETER num_ctx 131072
      3. Save as modelfile without an extension into the folder where you ollama files are (I suggest searching ollama on your windows to be sure)
      4. Go to the folder when you saved the file (assuming you do this on File explorer), type CMD into the address bar of that window to enter command prompt in the right location.
      5. Type ollama create mybiggerllama3.1 -f ./Modelfile
      this creates the new model file
      6. ollama run mybiggerllama3.1
      this opens the newly created file
      Optional
      Type ollama show mybiggerllama3.1
      Under parameters you should see num_ctx 131072
      IF everything was successful!

  • @Nick_With_A_Stick
    @Nick_With_A_Stick 4 місяці тому

    It’s such a shame llama.cpp doesn’t support paged attention(aka paged kv cache) as used in vllm. There was a feature request but one of the devs was like nah I’m good. It would allow huge context lengths without slow down, by using CPU memory.
    also ps, your videos are such high quality.

  • @ROKKor-hs8tg
    @ROKKor-hs8tg 4 місяці тому

    How do I run Olama on igpu intel?

  • @fotisj321
    @fotisj321 Місяць тому

    When you discussed how to create a new version of a model with a larger context, I was asking myself: why would you do that, if you can set the parameter for context size if you use the API and also in the Gui? And as always: Thanks for your great videos.

    • @technovangelist
      @technovangelist  Місяць тому +1

      Sure. But there is no gui that is part of ollama or even an official gui. And not all guis show it.

  • @crankypanini8421
    @crankypanini8421 4 місяці тому

    At last, someone posting an awesome Ollama /llama.cpp parameters video.
    Question more for the ollama serve parameter setting:
    1: In Windows: How do I select which gpu UUID I want ollama to use? I saw that the parameter or ENV variable is for example CUDA_VISIBLE_DEVICES.
    Thanks

    • @technovangelist
      @technovangelist  4 місяці тому

      I don’t think you can. Some work had been done recently on multiple gpus but still not really done yet

  • @wdonno
    @wdonno 4 місяці тому

    What is the max context length supported by an 8 GB card?

    • @technovangelist
      @technovangelist  4 місяці тому

      I’m not sure. I think with a 8b model at q4 you might get 4 or 8k tokens. But that depends on what else the machine is doing.

    • @wdonno
      @wdonno 4 місяці тому

      Thanks for the reply. This is a really helpful channel!

  • @jesusjim
    @jesusjim 2 місяці тому

    I found it to be a fire hose of info 😮 I loved it❤ need to go back and go over the CTX part for better use of llama3.1:70b on a 64GB ram Mac Studio

    • @jesusjim
      @jesusjim 2 місяці тому

      oops! 128k was way too much. swap used went over 40gb. lol

  • @flyingwasp1
    @flyingwasp1 4 місяці тому

    great video Matt. less rambling here

  • @konstabelpiksel182
    @konstabelpiksel182 4 місяці тому +1

    👍👍

  • @BirdManPhil
    @BirdManPhil 2 місяці тому

    i wish there was a tool that would let us put our system specs in and tell us what paramaters and quants can run on that system for the model we choose

  • @flat-line
    @flat-line 4 місяці тому

    What is the difference between perplexity vs temperature?

    • @technovangelist
      @technovangelist  4 місяці тому

      temperature is a parameter you can set to adjust the randomness of generated text. perplexity is an evaluation metric that measures how close the generated text is to the correct text or meaning of the text. One you can set, the other measures the output.

    • @flat-line
      @flat-line 4 місяці тому

      @@technovangelist thanks a lot !

    • @jjolla6391
      @jjolla6391 Місяць тому

      @@technovangelist how can the LLM measure the value of the output (perplexity) if it doesnt know the truth?

    • @technovangelist
      @technovangelist  Місяць тому

      The truth isn’t relevant. What is most likely the correct is key.

  • @tecnopadre
    @tecnopadre 4 місяці тому

    Wouldn't you say that temperature, windows size and top K are the only ones that makes the difference?

    • @technovangelist
      @technovangelist  4 місяці тому +1

      Most of the time there is little reason to use any of them but they all have value.

  • @mbarsot
    @mbarsot 4 місяці тому +2

    Hi! For the first time I’ve found the video too complex….maybe after the first three or four parameters you could create a dedicated video explaining the concepts in more details and showing a few examples?. I found it very hard to understand otherwise.

    • @technovangelist
      @technovangelist  4 місяці тому +2

      Pause and slowing down playback are useful.

  • @hasaanirfan6073
    @hasaanirfan6073 4 місяці тому +2

    Yayyyyyyyyyy

  • @JNET_Reloaded
    @JNET_Reloaded 4 місяці тому

    its ram it needs not nesserally gpu ram but ram in gen!

  • @selvakumars6487
    @selvakumars6487 4 місяці тому +2

    Too fast, Matt. But great content, if you can take time to explain each with an example would be better.

    • @NLPprompter
      @NLPprompter 4 місяці тому +2

      separate video for the most used one maybe...
      or all of them in separate video each that would be a nice digital investment, one day someone might find such as seed context in ML (the SEO for seed might be confusing, need SEO GPT to generate it)

  • @homberger-it
    @homberger-it 4 місяці тому +1

    Why are there so many Matts becoming ai video creators?? 😂

    • @technovangelist
      @technovangelist  4 місяці тому +2

      I used to work at a company called open text. My manager was matt Adney whose manager was matt brine whose manager was matt something else and the top guy was matt as well.

    • @technovangelist
      @technovangelist  4 місяці тому +2

      And when I was at Microsoft all the matt w’s in Redmond had an email group because we would all get each others packages and faxes.

    • @jelliott3604
      @jelliott3604 4 місяці тому

      ​@@technovangelist
      ua-cam.com/video/ZymUMAu_fB0/v-deo.htmlsi=EJh5jbfsrWw6PFDs

  • @thetrueanimefreak6679
    @thetrueanimefreak6679 4 місяці тому +1

    thank you for the expination, i really only knew about temperature