How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

Поділитися
Вставка
  • Опубліковано 12 січ 2025

КОМЕНТАРІ • 113

  • @cesarsantos854
    @cesarsantos854 Рік тому +37

    This content is top notch among ML and AI in UA-cam showing us how it really works!

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +1

      Thank you, I’m glad it’s helpful!

  • @timothymaggenti717
    @timothymaggenti717 Рік тому +9

    Okay so after a cup of coffee and watching a couple of times, WOW. You helped me so much thank you. This has been driving me nuts and you make it look so easy to fix. I wish I was as smart as you. Thank you again. 🎉

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      You always ask the best questions, so keep them coming :)

  • @pelaus01
    @pelaus01 Рік тому +10

    Amazing work... this channel is pure gold, the exact amount of concepts, everything is spot on. Nothing beats teaching by experience like you do.

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      I’m glad it was helpful and thank you for the comment :)!

  • @fabsync
    @fabsync 8 місяців тому +1

    Finally some freaking great tutorial! Practical, straight to the point and it works!!

  • @leont.17
    @leont.17 Рік тому +1

    I very much appreciate that you always have this way of listing the most important bullet points at the beginning

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      I’m glad it’s helpful! I figured it would be nice to give a quick overview

  • @HistoryIsAbsurd
    @HistoryIsAbsurd 11 місяців тому +1

    Dude seriously your content is so clear and easy to follow keep it up!

  • @RAG3Network
    @RAG3Network 8 місяців тому

    You’re literally a genius! I appreciate you taking the time to share the knowledge with us! Exactly what I was looking for… how to create a dataset and in such a well put together video. Thank you

  • @smellslikeupdog80
    @smellslikeupdog80 Рік тому +5

    I knew I subscribed here for good reason. this is consistently extremely high quality information -- not the regurgitated stuff. This is super educational and has immensely improved my understanding.
    Please keep going bud, this is great.

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Thank you! It’s greatly appreciated

  • @AemonAlgiz
    @AemonAlgiz  Рік тому +4

    Comedy dataset update! I have found an approach I think I like for it, though I didn't have time to complete it for this video. So, I will also cover that in today's live stream!

  • @boogfromopenseason
    @boogfromopenseason 8 місяців тому +1

    I would pay a lot of money for this information, thank you.

  • @rosenangelow6082
    @rosenangelow6082 Рік тому +2

    Great explanation with the right level of details and depth. Good stuff. Thanks!

  • @timothymaggenti717
    @timothymaggenti717 Рік тому +3

    Wow, how do you make everything look easy. Nice thanks. So East coast, man your early bird.

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      I live in MST, haha. I just wake up very early :)

  • @Hypersniper05
    @Hypersniper05 Рік тому +2

    Thats awesome! And you can even save the new appeal to create more data !

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Indeed! It becomes a very nice self reinforcing model, this is why I really like the fine tuning and embedding approach

  • @flowers134
    @flowers134 Рік тому +1

    Amazing, Thanks a lot for sharing your reflections on your work and experience ! It is much appreciated ! First time I check something like this quickly browsing and stick without having to review / study and come back later. I am able to get a Birds eye view on the topic and options available for work, and the underlying purpose. 🥇Pure Gold. Definitely Subscribed !

  • @kenfink9997
    @kenfink9997 Рік тому +2

    How would building a training set on a codebase look? Is there a good example of automating generation of a Q&A training set based on code? How do you chunk it to fit in context window - break it up by functions and classes? Where would extraneous stuff go, like requirements, imports, etc... Thanks for the great content!

  • @jonmichaelgalindo
    @jonmichaelgalindo Рік тому +3

    The appeal has been processed by the approval AI... And it passed! The prescription will now be covered. 😊
    (Thank you for the video! I think datasets and install dependencies are ML's greatest pain points at the moment.)

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Thank you! I’m glad it was helpful :)

  • @arinco3817
    @arinco3817 Рік тому +2

    This video was awesome! I'm finally starting to wrap my head round this stuff. At the same time I'm realising the power that is being unleashed onto the world!
    BTW did you see this new paper:SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. Looks like it's right up your alley!

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Thank you! I’m glad it’s helpful :D
      I have not seen this, this is super cool though, thank you for pointing me to it! I would love to see some implementation of pruning in LLM’s. Quantization is incredibly powerful, but we can only compress so much until we hit the limit. With pruning plus weight compression, as could run 30/65B parameter models on a single consumer GPU.

  • @kaymcneely7635
    @kaymcneely7635 Рік тому +1

    Superb presentation. As always. 😊

  • @danielmz99
    @danielmz99 Рік тому +1

    Hey man, thanks for your videos they are instructive. I am new to LLMs and I think there is a significant gap in UA-cam content with the new LLMs. I know there are videos on fine tuning GPT3 but I can't find anything like walk through in fine tuning a larger new open source model like Falcon-40b instruct. If there was a playlist going through the process: QA fine tune data definition, synthetic data production, fine tuning and test. I am sure others like myself will be very keen followers

  • @babyfox205
    @babyfox205 10 місяців тому

    great explanations thanks a lot for your efforts making this great content!

  • @PhantasyAI0
    @PhantasyAI0 Рік тому +1

    do you have a video on how to prepare a dataset for creative writing?

  • @MohamedElGhazi-ek6vp
    @MohamedElGhazi-ek6vp Рік тому +1

    it's so helpful thank you, what if I have a multiple pdf files at the same time and each one of them has his own subject can I do the same for them ?

  • @onurdatascience
    @onurdatascience Рік тому +2

    Awesome video!

  • @bleo4485
    @bleo4485 Рік тому +2

    Hi Aemon, i am new to local llm api setting up. Could you explain a little on how to get around setting it up? thanks

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Hey there! From the OobaBooga web application you can enable extensions, including the api. It will run on port 5000 by default!

    • @champ8142
      @champ8142 Рік тому +1

      Hi Aemon I checked api and public_api on the flags/extensions page, any idea why I can't connect to port 5000?

  • @Tranquilized_
    @Tranquilized_ Рік тому +3

    You are an Angel. 💜

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Thank you! I’m glad it was helpful :) I do like how you left your name that haha

  • @mohammedanfalvp8691
    @mohammedanfalvp8691 Рік тому +3

    I am getting an error like this.
    Token indices sequence length is longer than the specified maximum sequence length for this model (546779 > 2048). Running this sequence through the model will result in indexing errors
    Max retries exceeded. Skipping this chunk.

  • @filipbottcher4338
    @filipbottcher4338 Рік тому +1

    Well done but how do you handle the max model length of tokenizer.encode?

  • @SamuelJohnKing
    @SamuelJohnKing Рік тому +11

    I really love the concept, but whatever I have tried I get ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048)
    Could you please update it? it would be of immense value to me :)

  • @redbaron3555
    @redbaron3555 Рік тому

    Awesome content!! Thank you very much!!👏🏻👏🏻👍🏻

  • @cmosguy1
    @cmosguy1 Рік тому

    Hey @AemonAlgiz - How did you create the instruction set of data for the CYPHER query examples? Did you do that all manually?

  • @ĐôNguyễnThành-r1v
    @ĐôNguyễnThành-r1v Рік тому +1

    Hi, I have some confusion about your content about leveraging embeddings. My understanding so far is that, embedding approach simply means "few-shot learning". The pipeline is, say, I have a query, I embed the query into a vector and then search for similar vectors which represent relevant examples in a vector db, now I have my initial query + some examples of (query, answer) from the db. Then I somehow cleverly concat my query with the retrieved examples to form a long instruction/prompt, feed it to the llm and just wait for the output. Did I get my understanding right?

  • @天蓝蓝的
    @天蓝蓝的 Рік тому

    Amazing work! I would like to know if it is possible to use langchain to load pdfs to batch generate instruction datasets?

  • @wilfredomartel7781
    @wilfredomartel7781 Рік тому +2

    Amazing work! I still trying to understand the embeddings approach.😊

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +1

      Basically, we would rather teach the model how to use information than try to teach it everything. So, if we can give the model enough examples of what a procedure looks like, it can learn how to better follow it.
      So, take for example a para-legal or a lawyer. They’re well educated on how to write legal briefs, though they’re not aware of every law to exist. They have learned how to research and leverage information, which is what we’re trying to do with this approach.

    • @Hypersniper05
      @Hypersniper05 Рік тому +1

      The only way you'll understand it is by trying it yourself

    • @wilfredomartel7781
      @wilfredomartel7781 Рік тому

      @@Hypersniper05 you are right.

    • @wilfredomartel7781
      @wilfredomartel7781 Рік тому +1

      ​@@AemonAlgiz thanks for the explanation to my doubt. I will try yo reproduce in my colab pro.

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +1

      Let me know how the experiment goes!

  • @mygamecomputer1691
    @mygamecomputer1691 Рік тому

    Hi, I was listening to your description of raw text and then how are you converted it. But can you just upload a very short story that has the style you like and take all the defaults of the training tab and use the plain TXT file and make a lora that will be useful in that it will simulate the style I like in model I want to use?

  • @bleo4485
    @bleo4485 Рік тому +2

    Aemon, what time will your live stream be?

  • @unshadowlabs
    @unshadowlabs Рік тому +1

    When you uploaded the additional data in superbooga, did you have to prep it first as a question and answer format like you did on the fine tuning, or were you able to just upload books, files, etc for that part? Also thanks for doing these vidoes! These are by far the most informative on how this stuff works!

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +1

      I just naively dumped the entire file, which I wouldn’t do for a more sophisticated application. Though superbooga will just chunk the files for you, so you can just drag and drop massive files.

    • @unshadowlabs
      @unshadowlabs Рік тому +1

      @@AemonAlgiz Thanks, How do you deal with more complex formatted material, such as research papers? Are the parsers good enough to handle them without a lot of data cleaning or prep work on the paper first?

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +1

      @@unshadowlabs this has been my area of expertise for years! I worked in scientific publishing for over a decade, so what I find is that trying to naively parse them works to some extent, especially with research papers since they tend to be very topically dense. What you may find challenging is keeping all of the context densely packed, so it may be worth trying to split on taxonomic/ontological concepts.

    • @unshadowlabs
      @unshadowlabs Рік тому +1

      @@AemonAlgiz Awesome, thanks for the reply! A suggestion for a video, I would love to see how you deal with different types of content and sources and what type of data processing, wrangling, or cleaning, and what type of tools you recommend given your expertise, background, and experience.

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      This is a great idea, I have dealt with some nightmarish formats

  • @adriangabriel3219
    @adriangabriel3219 Рік тому +1

    Hi @AemonAlgiz, great video! I am using a similar approach (I use langchain for the handing over the documents to a LLM) and I have tried a wizardLM model which hasn't performed too great. What strategies (fine-tuning, in-context learning or other models?) would you recommend to improve the performance of answering a question given the retrieved documents? Can you recommend specific models (Flan-T5 or other models?)

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Gorilla is specifically tuned for use with langchain, so that may be an interesting model to test with. What kind of data are you want to use? That may influence my answer here

    • @adriangabriel3219
      @adriangabriel3219 Рік тому

      @@AemonAlgiz haven't heard of Gorilla so thank's for pointing that out! I would like to answer questions given paragraphs of a technical manual

    • @adriangabriel3219
      @adriangabriel3219 Рік тому

      Hi @@AemonAlgiz I don't quite understand how to use Gorilla with an existing vector database. Could you make a video on that or do you have guidance for that? Am I suppose to use the OpenAI API for that use case?

  • @d_b_
    @d_b_ Рік тому

    Could you clarify the performance of the LLMs where you provide it context, but dont do a fine tune? Was that last oogabooga medial appeal demo with a fine tuned model, or was it just using the additional embedded context?

  • @AadarshRai2
    @AadarshRai2 5 місяців тому

    top notch content

  • @aditiasetiawan563
    @aditiasetiawan563 10 місяців тому

    can you explain code to convert pdf to json.. i dont know how you doing that.. it's great and thats what we need.. thanks before

  • @darklikeashadow6626
    @darklikeashadow6626 10 місяців тому +1

    Hi @aemonAlgiz , I am new to Python (and LLMs) and wanted to try creating a dataset from a book as well. However when running the provided code, I got a warning:
    "Token indices sequence length is longer than the specified maximum sequence length for this model (181602 > 2048). Running this sequence through the model will result in indexing errors
    Max retries exceeded. Skipping this chunk." (which happened a lot).
    The new .JSON file was empty. I tried changing the "model_max_length": from 2048 to 200000 in the tokenizer_config from my model, but that only made the warning disappear (but the result was the same).
    Would love if anyone has a solution to this :)

  • @srisai00123
    @srisai00123 10 місяців тому

    Token indices sequence length is longer than the specified maximum sequence length for this model (249345 > 2048). Running this sequence through the model will result in indexing errors
    I am facing this issue, please help for resolution.

  • @CallisterPark
    @CallisterPark Рік тому +1

    Hi @aemonAlgiz - how long did it take to finetune stablelm-base-alpha-7b ? On what hardware?

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +2

      Howdy! Not very long for this, since it was a fairly small finetune, about an hour. I use an AMD 7950X3D CPU and a RTX 4090

  • @vicentegimeno6806
    @vicentegimeno6806 Рік тому +5

    Hi, I'm new to Python and getting an error related to the token sequence length exceeding the maximum limit of the model, could you please help me to solve the problem?
    ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048). Running this sequence through the model will result in indexing errors 2023-08-24 10:41:54.890169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

    • @SamuelJohnKing
      @SamuelJohnKing Рік тому

      would also love a answer to the Token Indicies issue

  • @protectorate2823
    @protectorate2823 Рік тому

    Hey aemon, how can I structure my dataset so it outputs answers in a specific format every time. Is this possible?

  • @AadeshKulkarni
    @AadeshKulkarni Рік тому

    Which model did you use on oobabooga ?

  • @xspydazx
    @xspydazx 8 місяців тому

    hmm... : I would like to be able to : Update the llm , ie by extrracting the documents in a folder , extracting the text and fine tuning it in ?
    ie : i suppose the best way would be to inject it as a text dump ~ HOW?(Please)
    ie take the whole text and tne a single epoch only !:
    As well as saving my chat history as a input/Response dump : single epoch only .
    Question : each time we fine tune ? it takes the last layer and makes a copy then trains the copy and replaces the last layer ? as the model weights are FROZEN? does this mean that they dont get updated ....? if so then the lora is applied to this last layer esentially replacing the layer ?
    If we keep replacing the last layer do we essentially wipe over the previous training ??
    i have seen that you can target Specific layers ? ... How to determine which layers to target? then create the config to match these layers?
    Question : How dowe create a strategy for regular tuning without destroying the last training ? should we be Targetting different layers each fine tuning ?
    Also Why canwe not tune it Live!! ie while we are talking to it ? or discuss with the model and adust the model whilst talking ? is adjusting the weights done by the AUTOGRAD? NN in pytorch with the optimization ? ie adam optimizer ? as with each turn we can produce the loss from the input by supplying the expected outputs to compare with simuarity so if the output is over a specfic threshhold it would finetune acording to the loss (optimize this(once)) ... ie switching between train and evaluation , (freezing a specific percentage of the model )... ? ie essentially woring with a live brain ???
    how can we update the llm with conversation , ??? by giving it the function (function calling) to execute a single training optimization based on user feedback ? ie positive and negative votes... and the current response chain ... ie if the rag was used then the content should be tuned in ??
    SOrry for the long post but it all connects to the same thingy?

  • @othmankabbaj9960
    @othmankabbaj9960 Рік тому

    When training a dataset, it seems the Q&A is too specific to the book. Wouldn't that make the model too specific to the use case you're training ?

  • @tatsamui
    @tatsamui Рік тому +2

    What difference between this and chat with documents?

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +2

      That’s a great question! You can encourage the model to “behave” in a particular way. Though of course you’re not really imbuing the model with knowledge you’re causing a preference for tokens that satisfy some requirement. For example, if I had enough samples for a solid fine tune on appeals it would write near human like in the process.
      So combining the influence on the models behavior with additional context from documents, you get a more modern version of an expert system. This is a technique we have been using in industry to get models to fulfill very specific use-cases.

    • @Hypersniper05
      @Hypersniper05 Рік тому +1

      Think of it as of you were using bing but the search results are very specific. This is good for closed domains and very specific tasks . I use it for work as well in closed domain data

  • @LikithVibes
    @LikithVibes Рік тому

    @AemonAlgiz How to enable Superbooga api .?

  • @PromptoraApps
    @PromptoraApps Рік тому

    iam getting this error"Max retries exceeded. Skipping this chunk."

  • @GamingDaveUK
    @GamingDaveUK Рік тому +1

    so with superbooga you could just drop in the file with the Q&A from the book, add an injection point in your prompt and the LLM has access to the data?
    That sounds too easy lol
    So say you want to have oogabooga be a storytelling ai, can you add the injection point in that opening prompt, feed it a Q&A made from stargate scripts and then have it use that data in responses to set tone and characters?

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Superbooga makes it pretty easy! They have a drag and drop embedding system and it handles the rest for you. It’s not going to be optimal for all use-cases but it works well in general

  • @LoneRanger.801
    @LoneRanger.801 Рік тому

    Waiting for new content 😊

  • @li-pingho1441
    @li-pingho1441 Рік тому

    thank you soooooo much

  • @LeonvanBokhorst
    @LeonvanBokhorst Рік тому

    🙏 thanks

  • @amortalbeing
    @amortalbeing Рік тому

    thanks man

  • @РыгорБородулин-ц1е

    I still understood literally nothing. What vector databases have anything to do with embedding vectors in language models? and how they get utilized anyway? This video being like "we mentioned them in adjacent sentences and this shows they can work together".

    • @AemonAlgiz
      @AemonAlgiz  Рік тому

      Howdy! I’m happy to try and explain anything that’s not clear. Where are things not making sense?

    • @РыгорБородулин-ц1е
      @РыгорБородулин-ц1е Рік тому

      @@AemonAlgiz the whole thing, the entire pipeline, especially for QA purpose. like, if I have a huge document put into a vector database, an embedding for a question about this document can very well be really far away from any relevant vector in the database, thus, making chances of getting relevant vector from the database smaller. if this vector affects further model generation, then we won't get answer on this question. it's also not clear how exactly this vector is getting used within the model anyway. it this concatenation? or used as a bias vector? or is it a soft promt?

    • @AemonAlgiz
      @AemonAlgiz  Рік тому +1

      @@РыгорБородулин-ц1е this is a great question! This is why we have the tags around different portions of the input, mainly to control the documents that are queried for. Since we can wrap the input, we have explicit control over what portion of the input text gets embedded for the query. Does that make more sense?
      Also, the way we chunk inputs helps to prevent getting portions of the document that aren’t relevant. The way I embedded in this example was naive, though we can use very intricate chunking methodologies to have a higher assurance of topical density.

    • @РыгорБородулин-ц1е
      @РыгорБородулин-ц1е Рік тому

      @@AemonAlgiz in such case, if we need explicit control over which documents/portions of documents are queried, it looks like queries in question look more like queries to old-fashioned databases and less like questions to a language model, with a lot of manual labour and engineering knowledge required to do make fruitful requests

  • @JAIRREVOLUTION7
    @JAIRREVOLUTION7 Рік тому

    Thanks for your awesome video, if you some day want to work as a mentor for our startup, write me dude.

  • @caseygoodrich9717
    @caseygoodrich9717 Рік тому +1

    Lipsync issue your audio

  • @pedro336
    @pedro336 Рік тому

    did you skip the training process?

  • @stephenphillips8782
    @stephenphillips8782 Рік тому

    I am going to get fired if you don't come back

  • @fndTenorio
    @fndTenorio Рік тому

    So in the embedding approach the embeddings are just additional information that are injected in the prompt itself? In other words, the fine tuned model knows how to do something, but i can use an extra help (the embedding info) to generate a better prompt? If so we are optimizing the prompt, right? Thanks for the video!

  • @leemark7739
    @leemark7739 Рік тому

    UnboundLocalError: local variable ‘iter’ referenced before assignment

  • @linuxbrad
    @linuxbrad Рік тому +10

    Wasted 10 minutes to find out you're using an API "oogabooga?" Instead of actually telling us how.