Upgrade to multi-AI: Update Vector DB to AI

Поділитися
Вставка
  • Опубліковано 15 січ 2025

КОМЕНТАРІ • 51

  • @1NN3RxC0R3
    @1NN3RxC0R3 Рік тому +11

    The timing on this video is perfect. I was literally looking at options for alternatives to using openai's costly ada embeddings especially when I'm working with thousands of pages for a specific domain. I think it would also be interesting to see alternatives to popular vector databases like pinecone. Maybe a comparison of new ones that are being developed, especially open source dbs.

  • @Davipar
    @Davipar Рік тому +2

    You are the best AI youtube channel out there mate. Thank you so much for you work and sharings!!

    • @code4AI
      @code4AI  Рік тому

      Your feedback is important! Thank you.

  • @hablalabiblia
    @hablalabiblia Рік тому +1

    You are the most important asset Open Source deep learning has. Hope to meet one of these days.

  • @jayhu6075
    @jayhu6075 Рік тому +2

    I am very glad to find your channel, with your knowledge, people can learn and save money locally with this innovation.
    Because otherwise it is too expensive for us to make his own business. I appreciate that your share your wisdom.
    Hopefully more from this stuff. Many thanks.

  • @SebastianSmith22
    @SebastianSmith22 Рік тому +1

    Amazing video! Thanks for all that knowledge. Just came in the right time! :) Best wishes from the neighbour, Hungary!

    • @code4AI
      @code4AI  Рік тому +1

      Thank you! Greetings to Hungary!

  • @1Esteband
    @1Esteband Рік тому +1

    Each of your videos is a masterclass !
    Thank you.

  • @danson3038
    @danson3038 Рік тому

    this video + future opensource replacement for gpt4 and we are there;-). Thanks again Mr!

  • @blendercomp
    @blendercomp Рік тому +1

    Awesome content - as always! :)
    I'm really looking forward to running locally an open source LLM so that the need for OpenAI is completely eliminated!

  • @toddnedd2138
    @toddnedd2138 Рік тому +2

    Thanks for the video. Unfortunately, I have not yet fully understood the complete process in detail. Maybe you can give some more depth information about that.
    1. Do you use only the first column [cls] of the embedding matrix for the similarity calculation?
    2. How do you score the results of the similarity calculation and rank them?
    3. If you find the documents with the highest similarity (lets say top 5 documents) what is the content of the documents that you pass to gpt-4, the paragraphs with the highest similarity or some summary of the document or do you use some lang-chain approach (document map)?

    • @code4AI
      @code4AI  Рік тому +3

      Short intro: ua-cam.com/video/ySTox2rdguM/v-deo.html
      Similarity (2 years old, but still relevant): ua-cam.com/video/MAnROXO_bnU/v-deo.html
      I just know my videos by heart, but there are a lot of other information sources on the internet about cosine similarity (text and matrix visualizations). It is not difficult at all.

    • @toddnedd2138
      @toddnedd2138 Рік тому +1

      @@code4AI thank you for the quick answer. You are right, it is not difficult at all, the hard part is to find all the different pieces that matter. Therefor thanks you again for pointing me in the right direction.

  • @littlegravitas9898
    @littlegravitas9898 Рік тому +3

    Hi, I have a project I'm working on with so friends (as a proof of concept in my own research in intelligent systems) and your videos are going to be invaluable! Is there any way we can reach out to you? I'd really love to ask a couple of questions, but I appreciate if you're too busy, the content you provide is so much already!

  • @MaherRagaa
    @MaherRagaa Рік тому +1

    Hello @code_your_own_AI, Thank you very much for this valuable and informative content 👍. We tried Embedding for Arabic language using various models but we could not achieve acceptable accuracy in Symantec search, any recommendations? Thanks 😊

    • @code4AI
      @code4AI  Рік тому

      I have no practical experience with your language, since almost all LLM models are pre-trained in English first. I have read, that new LLM models will be created and pre-trained in Korean, for the Japonic languages in Japan, and multiple other nations want to create LLM (and pre-train) in their respective languages, since semantic structures can differ significantly from English. Almost forgot, Germany introduced its first german LLM. So maybe you should explore if there are LLM in your language, that have been pre-trained (and not only fine-tuned) in your language. ..... and then the tokenizers are structuring your language elements in an optimised form for all vector embeddings. ...just some first ideas from my side.

  • @pizzaiq
    @pizzaiq Рік тому

    Thank you for sharing.

    • @code4AI
      @code4AI  Рік тому

      Hope it is informative.

  • @dtkincaid
    @dtkincaid Рік тому +1

    Very helpful video. I have to go check out your sbert videos now too. If you are not using a database of some sort to store the embedding of your documents, are you recalculating all the embeddings whenever your app starts up and putting them in memory? Would you use something like FAISS which is free to use to create a vector index in memory? Would that offer better performance than hand coding the search algorithm?

    • @code4AI
      @code4AI  Рік тому +4

      You have a variety of options to store data. Since my systems work with structured streaming and are scalable (GPU, TPU), Databricks invented Delta Lake Architectures (almost 3 years ago) for all kind of data formats, for a lot of distributed systems /and APACHE compatible. I find their solutions helpful, but you can go with normal Data Lakes and upscale them, and other individual solutions are plentiful.
      Just use "Delta Lake" for a search on my channel about them.

  • @davidchang1586
    @davidchang1586 Рік тому

    Amazingggggg❤ very nice video!

  • @VIVEKKUMAR-kx1up
    @VIVEKKUMAR-kx1up Рік тому +1

    love your video!!

    • @code4AI
      @code4AI  Рік тому +1

      Feedback like yours is one of the reason why I create this videos. Thank you!

    • @VIVEKKUMAR-kx1up
      @VIVEKKUMAR-kx1up Рік тому +1

      @@code4AI highly appreciated

  • @sirrr.9961
    @sirrr.9961 Рік тому

    I am not a programmer but since I have started understanding LLMs I an idea that general purpose embeddings won't do anything for any domain-specific work. But I didn't know how to do this myself. Glad that you have made this video.
    I have one more question that how can we train an open source LLM like gpt 4 all for domain specific storytelling like GPT 4 or even chatGPT 3.5 turbo. Please make a detailed tutorial on getting custom prompt outputs from any LLM model. And also give us a practical example of how can we prepare our own data for inputting into our own embeddings.
    One last thing is how can we download a huge pile of data automatically?
    I know these are too many requests but I am really taking interest in building something for my own country now which no one has built before. 😁

  • @WillMcCartneyAI
    @WillMcCartneyAI Рік тому +1

    Is the core ramification of this specific workflow that a person need not use an extra vector database - essentially a customised SBERT AI would be enough?

  • @Cloudvenus666
    @Cloudvenus666 Рік тому +1

    Another amazing video! Relying on Openai entirely is a very costly endeavor. Especially if one uses third party vector database too. Do you think we can use peft adapters for sparse retrieval and rerank? Or do we need to fine-tune all layers without freezing any parameters for this to work effectively? If we can utilize LoRA, that would be awesome, but from my understanding, there is a slight trade-off on accuracy.

    • @code4AI
      @code4AI  Рік тому +5

      Parameter efficient fine-tuning is recommended for really huge LM, where fine-tuning would be too costly. But since I operate my AI on a BERT (SBERT) system, those models are really tiny compared to GPT-4. I can fine-tune all my SBERT AI on a single GPU, without any parallelism or PEFT. But if you want speed, choose JAX + FLAX + BERT (fasten seat belts before). Smile.

    • @Cloudvenus666
      @Cloudvenus666 Рік тому

      @@code4AI That makes sense, thank you 😊. Have you done any videos previously utilizing Jax and Flax? I know someone that used JAX on whisper, and it was 70x faster at transcribing.

    • @code4AI
      @code4AI  Рік тому +8

      I have 5 new videos on JAX and FLAX coming up in about a week or so ....

    • @VIVEKKUMAR-kx1up
      @VIVEKKUMAR-kx1up Рік тому +2

      @@code4AI will be waiting patiently

    • @thezhron
      @thezhron Рік тому

      Waiting too 😂

  • @Danpage04
    @Danpage04 Рік тому

    Does the embedding method create a self-attention matrix on the input sentence ?

  • @constantinebimplis
    @constantinebimplis Рік тому +1

    Great work thank you! I don't like paying for embeddings as well. You mention Huggingface embeddings which are great. My question to you is, how do hf embeddings compare to llama.cpp embeddings which have been released recently with 4096 dimensions? Would you use these? My use case is semantic search in large volumes of text. Thank you

    • @code4AI
      @code4AI  Рік тому +1

      My first step is: the vocabulary of your system, the tokenizer, their abilities and the general coverage over a representative body of text. The dimensions of the mathematical vector space is not so important. If you want to generate a vector embedding of a vocabulary of 12k tokens in a 4k dimensional VS, the magnitude of permutations of tokens is hardly ever so high to justify an extreme high dimensional vector space. You just generate sparse matrix representations. But this is just a general remark, it depends on your training dataset, your tokenizer, your model, the layer structure, pre-training intensity, ...... you get the idea.
      Maybe this helps: ua-cam.com/video/MlDP2BVWjS0/v-deo.html
      or ua-cam.com/video/04oZ2P0uvp0/v-deo.html

    • @constantinebimplis
      @constantinebimplis Рік тому

      @@code4AI thank you, I devour your videos non-stop. I appreciate your efforts, content, knowledge and your time.

  • @WillMcCartneyAI
    @WillMcCartneyAI Рік тому +1

    have you got any thoughts on preserving privacy? Will this require using an LLM other than GPT-4 for completions?

    • @IvarDaigon
      @IvarDaigon Рік тому +1

      Data sent to open AI via the paid API service is not stored or used for training.
      It is only the free ChatGPT web interface that collects data for "training".
      However, that being said, all data sent to OpenAI and all responses from the models are checked to make sure they comply with the terms of service so nothing actually prevents OpenAI from alerting authorities if users are using the API for criminal activity.

  • @AdamBrusselback
    @AdamBrusselback Рік тому +1

    So, I think I understand, but I'm not quite sure and wanted to clarify. I can understand using an encoder model rather than a vector DB for things like scientific papers in your industry and all, but wouldn't you still want a vector database for canonical files you are working with for example, within your company?
    Wouldn't your SBERT integrate nicely with a vector database to provide context/domain sensitive tokens for similarity search over your vector DB dataset, while also provide a secondary set of similarity tokens for "general industry knowledge" on the topic?

    • @code4AI
      @code4AI  Рік тому +1

      I don't want to surprise you, but Database solutions are long gone. If you want to learn about current solutions (ua-cam.com/video/G8wQAlVGYVM/v-deo.html) or to learn about the transition of parquet files to Delta Lake for structured streaming and other parallel features (ua-cam.com/video/toWQ1HVYSo0/v-deo.html) there are professional solutions out there. I know Databricks now for almost 5 years, and I would recommend their DATA+AI solutions for corporate solutions.

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому

    Is the output dimension the length of the embedding?

    • @code4AI
      @code4AI  Рік тому

      I need more information. Output dimension of what ? In the system we have multi-AI components ...

  • @midnightmoves7976
    @midnightmoves7976 Рік тому

    I worked it out :D

  • @thezhron
    @thezhron Рік тому

    There's a new pre print talking about "Semantic Tokenizer for Enhanced Natural Language Processing" . I think it would be a golden Tokenizer, maybe? 😅

    • @thezhron
      @thezhron Рік тому

      "Our experimental results show top performance on two Glue tasks using BERT-base, improving on models more than 50X in size."

  • @riser9644
    @riser9644 Рік тому

    Y Berta y not Roberta or t5 architecture

  • @Stopinvadingmyhardware
    @Stopinvadingmyhardware Рік тому

    Yggdrasil-Framework

  • @scottmiller2591
    @scottmiller2591 Рік тому

    I'm unclear on how you are maintaining the alignment of the new embeddings of your SBERT with the embeddings ChatGPT wants - are you freezing ChatGPT and training end to end, so SBERT is generating embeddings that are meaningful to ChatGPT, that is, something interpretable to ChatGPT as information? The issue I'm having is there is no reason one embedding run results in anything like the same embedding as another run, even with the same data and algorithm, as long as each run starts with with a random initialization of the weights. It would be essentially impossible with different algorithms. Or are you not cold-starting SBERT but just fine-tuning an SBERT embedding ChatGPT already knows about?
    Also, I'm surprised that you can access the embedding layer of ChatGPT to give it information, which is what makes this all work. OpenAI has been very tight-lipped about ChatGPT details, only giving vague generalizations, and it seems like it would be pretty easy to extract the information, if not the actual weights and architecture, if you have access to both the latent layer and the output, but I'm just speculating on my experience doing the same thing with similar structures; I don't have experience w doing this w ChatGPT. It would be too expensive to do this extraction for an individual, but there are plenty of organizations that would be capable of doing it.
    Maybe I'm missing your point here completely - are you feeding SBERT embeddings into the latent input of ChatGPT at all (which was the impression I got), or are you just getting the top-k documents using the cosine similarity of the output of the SBERT, then asking ChatGPT to summarize those top-k documents for you through the conventional text input? That would get around a number of the questions above, and has been something I'm very interested in (I've done some work in this direction, and if this is what you're doing, I agree SBERT is a good choice), as it scales to something that works locally.
    At any rate, an interesting video, with a lot of food for thought!

  • @klammer75
    @klammer75 Рік тому

    If I’m hearing you correctly, you would fine tune several models for whatever purpose I want, like a sentiments analysis model, a Q&A model, a summarization model, a quiz and answering model, all fine tuned and then use GPT4 to refine the outputs of those finetuned models for an optimal output while by;adding the cost of using GPT4 for all these inference steps thus reducing costs….does that sound like the right path according to this video? Very much interested as am thinking commercial applications here and cost as scale becomes a real issue🫣 thanks for the insights and great video🥳🦾

    • @code4AI
      @code4AI  Рік тому

      I don't know why you would use a separate model for each singular task, I think that the benefits from a multi-task pre-training (and even fine-tuning) are significant. If the transformer model has enough "learning capacity" (layers, ..) and is not limited to the bare minimum, you should not encounter problems (forgetting, other artifacts). But if you prefer it in this way, why not?