ColPali: Vision-Based RAG System For Complex Documents

Поділитися
Вставка
  • Опубліковано 14 січ 2025

КОМЕНТАРІ • 44

  • @manishsharma2211
    @manishsharma2211 3 місяці тому +1

    one thing what I like about this man is - he shows some background on each line / framework / library used to make people aware about all those nuances interactions b/w projects and researchers involved in it. love that

  • @israelazarkovitch5852
    @israelazarkovitch5852 4 місяці тому +7

    Colpali is an excellent technique for English documents.
    When you try to use non-English documents, the retrieval doesn't work well because colapli uses the paligemma model which is a relatively small model trained mostly on an English data set

    • @engineerprompt
      @engineerprompt  4 місяці тому +4

      good point, but I think you can finetune the vision model for other languages. Qwen is probably a good option there as well. Will see if there are any resources available and will share.

    • @vardhan254
      @vardhan254 4 місяці тому +1

      qwen 2 VL is good for indic languages atleast from what i have tested

  • @saranepalashok
    @saranepalashok 4 місяці тому +1

    Excellent. Exactly what I was looking for. A "fine-tuning" episode of such a VBRAG pipeline would be a great followup episode.

  • @frag_it
    @frag_it 4 місяці тому +5

    Can you make an end to end project where instead of an index we throw the embedding to a vectorstore like chromadb or pinecone or something would be amazing

  • @kai_s1985
    @kai_s1985 4 місяці тому +2

    Great work! Thanks! I wonder how it compares to vanilla RAG for text pdfs in terms of accuracy? Vanilla RAG suffers when the answer for the user question needs to be synthesized from different parts of the text. GraphRAG is good for those cases bit it is slow and expensive. Can this handle complex questions like those?

  • @tirushv9681
    @tirushv9681 Місяць тому

    Im a big fan of this approach. I do have a question here? We are anyway feeding an image and query to the LLM at the end? why not pass pdf itself to the LLM like claude we using at the end?

    • @theepicosityofpizza
      @theepicosityofpizza Місяць тому +1

      It's actually an interesting question about how Claude is parsing the pdf you upload to it. Is it also treating it like an image or just turning it into text?
      Think you're probably right that you'd get better results from parsing the pdf though

  • @tirushv9681
    @tirushv9681 Місяць тому

    Infact Quadrant also supports multi vector embeddings

  • @darpnpro
    @darpnpro 3 місяці тому

    Thank you for sharing this!

  • @loicbaconnier9150
    @loicbaconnier9150 4 місяці тому +2

    Why do we need a large Vram GPU ? where for Colpali or VLM ?

  • @LTBLTBLTBLTB
    @LTBLTBLTBLTB 4 місяці тому +5

    I tried to do this technique but with gemini-1.5-flash-exp-0827 and it works fine.

  • @stardustlo8323
    @stardustlo8323 Місяць тому

    Hi, great video! When I ran the `RAG.index()` function from byaldi on my T4 instance, it took about 20 seconds per PDF page. Is this expected? Also, does byaldi support GPU for embedding, and is it automatically utilized?
    Thanks!

  • @IdPreferNot1
    @IdPreferNot1 4 місяці тому

    Cool find in Claudette

  • @manishsharma2211
    @manishsharma2211 3 місяці тому

    EXCELLENT VIDEO - THANK YOU

  • @BACA01
    @BACA01 4 місяці тому

    Very good content, thank you.

  • @shobhitbishop
    @shobhitbishop 4 місяці тому

    Will this work properly on pdf comprising detailed tabular information? And the hand drawn images?

  • @athsarafernando
    @athsarafernando 19 днів тому

    What is the advantage of using these VLLM methods instead of just converting the pdf to makrdown?

  • @diego.castronuovo
    @diego.castronuovo 3 місяці тому

    What if the information to answer a question is in two consecutive pages, and only the first one is retrieved because the second only contains the continuation of the first.
    This is a real problem.

    • @engineerprompt
      @engineerprompt  3 місяці тому

      You can ask to retrieve multiple images/pages or can append the neighboring pages to your context. To make all this simple, I have put together an OSS, video coming soon: github.com/PromtEngineer/localGPT-Vision

  • @wdonno
    @wdonno 4 місяці тому

    How do you "chunk" or parse sections out of longer documents? Or if we want to create a Knowledge graph? The ultimate analysis is done on an LLM, so we still have context length issues especially for local implementations. Can you extract the text itself for further processing?

    • @MrAhsan99
      @MrAhsan99 Місяць тому

      unstructured, llamaparse, upstage parse

  • @saeeds851
    @saeeds851 2 місяці тому

    Would this approach work well with summarization with qwen2 vl 7b locally for technical papers with diagrams. Thank you.

    • @engineerprompt
      @engineerprompt  2 місяці тому

      Yes, checkout the localgpt-vision project that implements and end to end vision based RAG. ua-cam.com/video/YPs4eGDpIY4/v-deo.html

  • @tecnom7133
    @tecnom7133 4 місяці тому

    may be if you pass an Image URL instead of the Image bytes you will consume less input tokens so less Cost?

  • @goran-ai
    @goran-ai 4 місяці тому

    What is the best way to contact you for consulting with our dev company?

  • @Yes-lm9dq
    @Yes-lm9dq 4 місяці тому

    Do you think one could use this and convert a pdf into a text file which can be used to generate a knowledge graph using Microsoft's GraphRAG?

  • @absar66
    @absar66 4 місяці тому

    many thanks for this great video…I have a set of scanned pages saved as pdf. will this work?..thanks..

    • @engineerprompt
      @engineerprompt  4 місяці тому

      Yes, I think this approach will work on scanned pages as well.

  • @tecnom7133
    @tecnom7133 4 місяці тому

    Thanks

  • @RedCloudServices
    @RedCloudServices 4 місяці тому

    I wonder if a VBRAG could perform math calculations extracted from an image table? 🤔 I suppose if the results are accurate they could then be passed to another agent capable of calculations on the result?

    • @engineerprompt
      @engineerprompt  4 місяці тому

      math might be a little hard but I think its worth trying.

  • @user-uk9ls
    @user-uk9ls 4 місяці тому

    This does not run on local Nivida RTX 4x 16 RAM GPU ?

    • @engineerprompt
      @engineerprompt  4 місяці тому

      I think that will be able to run the pipeline.

  • @amortalbeing
    @amortalbeing 4 місяці тому

    thanks

  • @neatpaul
    @neatpaul 4 місяці тому +1

    But this works only for pdf, what about docx, pptx, epub files? I want to work multimodal on those files too.

    • @fra4897
      @fra4897 4 місяці тому +6

      it works with whatever that can be converted to image, so everything

  • @kareemyoussef2304
    @kareemyoussef2304 3 місяці тому

    None of these solutions are open source..even in your other videos. I think the video you have that uses marker is the only one

  • @ayanshproplayer5559
    @ayanshproplayer5559 4 місяці тому

    Offline work???

    • @engineerprompt
      @engineerprompt  4 місяці тому

      if you watch the video, you will know the answer :)