I'm currently building with qdrant (love the binary quantization and multi vector approach to scale retrival with colpali) I was wondering if a js/ts example exists because thats primarily our tech stack. If not I'll try to put something out eventually.
i've used tesseract ocr to get text from images and the result is just ok, certainly no where near what you shown colpali can do. will certainly give it a try.
What approach did you take? Did you extract the images and store their text summaries? Then, during retrieval, did you use these summaries along with the original images to get the answer?
Towards the end, she passed the whole image to a large 90b llama or gpt 4o, what's the point if have to pass the whole image instead of patches. Better owuld be if we can get the patches retrieved using copli and run some small vision model to extract answer.
@@EvgeniyaSukhodolskaya You can just ask GPT to extract the information you want and save it to the vector db. Don't have to analyse the image every time.
@@haralc6196 well then you need to think about every possible question you could answer with this one pdf page, and ask GPT-4o generate all of them&answer all of them. Not 100% it will cover all, and if you're doing VRAG and need PDF to be retrieved regardless (say, to look at the graph/chart), you'll have to save it regardless in db. Imo doesn't make much sense, unless it's a specific Q&A use-case/you really want to use gpt-4o for the sake of using it/volumes are too big for ColPali.
I've been waiting for a break down like this to help me wrap my head around ColPali. Thanks!!
Haha same here
I'm currently building with qdrant (love the binary quantization and multi vector approach to scale retrival with colpali) I was wondering if a js/ts example exists because thats primarily our tech stack. If not I'll try to put something out eventually.
Thanks! We don’t have a JS/TS example yet, but we'd love to see what you create if you decide to put one together!
i've used tesseract ocr to get text from images and the result is just ok, certainly no where near what you shown colpali can do. will certainly give it a try.
What approach did you take? Did you extract the images and store their text summaries? Then, during retrieval, did you use these summaries along with the original images to get the answer?
Towards the end, she passed the whole image to a large 90b llama or gpt 4o, what's the point if have to pass the whole image instead of patches. Better owuld be if we can get the patches retrieved using copli and run some small vision model to extract answer.
You can, using ColPali's attention mask
@@EvgeniyaSukhodolskaya what does that mean? how to do it?
This is amazing. Punch down OCR’s
Why is this better than asking GPT4o to read the image?
1) It's a free model
2) It's optimized for retrieval
You can't ask GPT4o to read 100k pages each time your user wants to find some answer among them:)
@@EvgeniyaSukhodolskaya You can just ask GPT to extract the information you want and save it to the vector db. Don't have to analyse the image every time.
@@haralc6196 well then you need to think about every possible question you could answer with this one pdf page, and ask GPT-4o generate all of them&answer all of them. Not 100% it will cover all, and if you're doing VRAG and need PDF to be retrieved regardless (say, to look at the graph/chart), you'll have to save it regardless in db. Imo doesn't make much sense, unless it's a specific Q&A use-case/you really want to use gpt-4o for the sake of using it/volumes are too big for ColPali.