RAG for Complex PDFs

AI Makerspace

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 15 жов 2024

КОМЕНТАРІ •

@AI-Makerspace 10 місяців тому ⁺¹²
Notebook 1(EmbeddedTablesUnstructuredRetrieverPack): colab.research.google.com/drive/1AyKbc6DYU3b9gWq52SU7qacH7q8DeKOf?usp=sharing
Notebook 2 (Step-by-Step): colab.research.google.com/drive/1ffaV5iFRvhzGIq8YSckf-VJ-y0AZFmA-?usp=sharing
Slides: www.canva.com/design/DAF2T6jNxuk/5HNaXFnIas1oE5bKCCLHkg/view
@hoangvanhao7092 10 місяців тому ⁺⁴
00:06 Tools for working with complex PDFs have limitations.
02:49 Retrieval augmented generation (RAG) aims to improve generations by using references.
08:00 Searching for similarity in vector databases using cosine similarity.
10:48 LLM applications thrive with private and domain-specific data to gain a competitive advantage.
16:10 Complex PDFs can be streamlined by analyzing qualitative and quantitative data within a single document.
18:42 Using hierarchical node references for table summarization and retrieval
23:25 Using the Llama pack for data pre-processing
25:58 RAG connects structured data for hierarchical retrieval
30:32 Improving performance through better model and specific tabular data queries
32:39 Using recursive retriever from llama index to improve PDF processing
37:17 Hierarchical chunking in PDFs requires manual engineering and building node graphs.
39:27 Using HTML representation first provides a huge improvement in preserving structured information compared to directly parsing from a PDF.
44:07 Local inference package for PDF parsing
46:19 Use GPT and synthetic data for building models
50:38 Complex systems are expensive to navigate efficiently
52:50 Unstructured data parsing and PDF extraction tools
57:04 RAG can be used for retrieval augmented question answering, but it's a subset of actual RAG.
59:19 Explore deep language models and cloud computing services
@AI-Makerspace 10 місяців тому
Awesome recap!
@ashutoshsrivastava_ 7 місяців тому
00:06 Tools for working with complex PDFs have limitations.
02:49 Retrieval augmented generation (RAG) aims to improve generations by using references.
08:00 Searching for similarity in vector databases using cosine similarity.
10:48 LLM applications thrive with private and domain-specific data to gain a competitive advantage.
16:10 Complex PDFs can be streamlined by analyzing qualitative and quantitative data within a single document.
18:42 Using hierarchical node references for table summarization and retrieval
23:25 Using the Llama pack for data pre-processing
25:58 RAG connects structured data for hierarchical retrieval
30:32 Improving performance through better model and specific tabular data queries
32:39 Using recursive retriever from llama index to improve PDF processing
37:17 Hierarchical chunking in PDFs requires manual engineering and building node graphs.
39:27 Using HTML representation first provides a huge improvement in preserving structured information compared to directly parsing from a PDF.
44:07 Local inference package for PDF parsing
46:19 Use GPT and synthetic data for building models
50:38 Complex systems are expensive to navigate efficiently
52:50 Unstructured data parsing and PDF extraction tools
57:04 RAG can be used for retrieval augmented question answering, but it's a subset of actual RAG.
59:19 Explore deep language models and cloud computing services
@twoplustwo5 10 місяців тому ⁺⁶
🎯 Key Takeaways for quick navigation:
00:03 🗂️ *Introduction to the complexity of working with PDFs*
- Discusses the challenges of working with complex PDFs,
- Highlights the limitations of current tools in handling complex PDFs.
01:13 📚 *Overview of the session's objectives*
- Outlines the session's goal to explore how to build pipelines for complex PDFs,
- Introduces the speakers and their roles.
02:21 🎯 *Session's learning objectives*
- Explains that the session will teach how to build retrieval augmented generation (RAG) systems,
- Mentions the partnership with Llama Index and Unstructured Do.i for the event.
03:18 🧱 *Core constructs of Llama Index*
- Discusses the importance of understanding the core constructs of Llama Index,
- Explains how to deal with unstructured text and structured tabular data simultaneously.
04:53 🔄 *Overview of the retrieval augmented generation process*
- Provides a detailed explanation of the retrieval augmented generation process,
- Emphasizes the importance of retrieval in improving generation.
06:14 📊 *Introduction to Llama Index*
- Introduces Llama Index as a data framework for large language model (LLM) applications,
- Highlights the importance of data in the success of LLM applications.
07:38 🧩 *Understanding the core constructs of Llama Index*
- Explains the concept of nodes in Llama Index,
- Discusses the role of the Retriever and the Query Engine in Llama Index.
09:42 📑 *The process of data retrieval*
- Describes the process of data retrieval in Llama Index,
- Discusses the importance of improving retrieval to improve generation.
11:39 📚 *Core constructs of Llama Index*
- Reiterates the importance of understanding the core constructs of Llama Index,
- Discusses the concept of nodes and their role in Llama Index.
13:54 📈 *Improving retrieval to improve generation*
- Discusses the importance of improving retrieval to improve generation,
- Highlights the role of the Retriever and the Query Engine in improving retrieval.
15:03 📊 *The importance of dealing with embedded tables*
- Discusses the importance and challenges of dealing with embedded tables in documents,
- Highlights the use case of annual reports as an example of documents with both text and tables.
18:11 🛠️ *The process of data pre-processing*
- Explains the process of data pre-processing in Llama Index,
- Discusses the use of the Embedded Table Unstructured Retrieval Pack for data pre-processing.
22:31 📝 *The process of converting PDFs to HTML*
- Demonstrates the process of converting PDFs to HTML for data pre-processing,
- Discusses the use of the PDF to HTML EX tool for the conversion process.
25:28 🔄 *The process of recursive retrieval*
- Explains the process of recursive retrieval in Llama Index,
- Discusses the use of hierarchical node references for recursive retrieval.
26:22 📚 *Using Llama Index for RAG systems*
- Discusses the use of Llama Index for building RAG systems,
- Highlights the use of GPT-4 and Ada embeddings from OpenAI,
- Explains how Llama Index is used to glue all the pieces of a RAG system together.
28:12 🛠️ *Llama Pack for data processing*
- Demonstrates the use of Llama Pack for data processing,
- Discusses how it can be used to ask questions and get responses,
- Explains how to modify the amount of retrieved documents and the LLM for better performance.
31:35 🔄 *Converting PDFs to HTML and building the index*
- Explains the process of converting PDFs to HTML and building the index,
- Discusses the use of the flat reader from Llama Index and the unstructured node parser,
- Highlights the creation of a map and the use of the Vector index store.
33:11 📈 *Importance of data processing*
- Discusses the importance of data processing in improving the performance of the system,
- Highlights the use of OCR implementations and the conversion process,
- Discusses the potential of using multi-modal language models for processing graphics or charts.
35:05 🚀 *Encouragement to start building with these tools*
- Encourages viewers to start building with these tools and making an impact,
- Discusses the potential of these tools in dealing with various data types,
- Highlights the importance of testing and building with these tools.
36:09 💡 *Addressing questions from the chat*
- Addresses various questions from the chat about the use of unstructured partition PDF, privacy, and the API,
- Discusses the use of locally hosted solutions to keep everything private,
- Highlights the potential of modifying the base.py file to run the local pipeline.
50:42 📈 *Scaling and navigating large document spaces*
- Discusses the challenges and costs associated with scaling and navigating large document spaces.
- Highlights the importance of metadata application and filtering in managing large document spaces.
- Suggests breaking down tasks into specific questions to make the process more efficient.
52:18 🗺️ *Node mapping process*
- Explains the node mapping process in the document, which involves constructing a hierarchical node structure.
- Discusses the importance of understanding the type of text and document component for better data understanding and relevance.
53:40 📄 *Using Pi PDF for text extraction*
- Discusses the use of Pi PDF for text extraction from PDFs.
- Highlights the limitations of Pi PDF and similar tools in extracting structured data from PDFs.
54:36 📊 *Evaluating the quality of the system*
- Discusses the challenges in evaluating the quality of the system, especially in terms of information loss during the conversion process.
- Suggests using tools like ragus and llama index LM as a judge Frameworks for evaluation.
- Emphasizes the importance of building systems that are sensitive to information loss.
56:38 🤔 *Retrieval Augmented Generation (RAG) vs Document Question Answering*
- Discusses the relationship between RAG and Document Question Answering, suggesting that the latter is a subset of the former.
- Highlights that RAG can be used for tasks beyond question answering.
- Suggests viewing RAG as a question answering machine as a useful starting point for learning.
Made with HARPA AI
@AI-Makerspace 10 місяців тому
Thanks for the recap @twoplustwo5!!
@houssemabdellaoui5166 10 місяців тому
@Grerg , This is perfect. Our company is working on a project to build a RAG on top of this kind of documents for French language. Thank you for this tutorial. This is very helpful :)
@AI-Makerspace 10 місяців тому
Love to hear this!! Good luck getting your system dialed in and creating huge value for you!
@GuillaumeLarracoexea 10 місяців тому ⁺²
Excellent intro to the topic. Thank you guys
@AI-Makerspace 10 місяців тому
Thank you! Glad it helped!
@cc的大肚子 7 місяців тому ⁺¹
As a non-native English speaker, I really like your accent.
@AI-Makerspace 7 місяців тому
Thanks!
@abhishekmittal278 6 місяців тому
Very Informative. Great Work. Thanks
@souravbarua3991 10 місяців тому ⁺¹
Thank you for this video. Conversion of pdf to html, a new thing I learned for rag. It will be very helpful if you do this with open source LLM as all can't afford openai api.
@smarasan 10 місяців тому
Great job! comprehensive and clear explanations. How about using llama2 or mistral llm with complex pdfs?
@AI-Makerspace 10 місяців тому ⁺²
@@smarasan @souravarua3991 you can definitely use open-source LLMs for any of these applications; however your results may vary. It's best to baseline performance with the OpenAI tools, and then compare the performance to open LLMs.
If you're looking for resources on how to leverage open-source models, start here:
ua-cam.com/video/bSTlEcAcx1o/v-deo.htmlsi=ZjNZ9zTYZBkGu6fy
Then, if you want to see how to set up Llama 2 for a RAG system look here:
ua-cam.com/video/JpQ61Vi5ijs/v-deo.htmlsi=CaewNNpiH5a0Oa07
You can also find Mistral-7B information here: ua-cam.com/video/Id4COsCrIms/v-deo.htmlsi=2WLel6p-PiYoCy_X
Basically, you'll just want to add an open-source LLM to the service context in the LlamaPack notebook (Notebook 1).
Let us know how it goes!
@oleksandrgnatyuk8388 10 місяців тому ⁺¹
I've had trouble getting clear answers to basic questions from financial statements, like 'What’s the cost of revenue for the three months ended October 29, 2023?', following your approach. But when I switch to OpenAI's vision model and feed it a screenshot of the relevant table, the results are spot-on. This vision model can interpret and describe a whole table very accurately (just as I remember being asked to do in my Statistics 101 class 👹).
So, it seems logical to suggest that we could streamline the process by feeding entire financial documents into the vision model. This would allow the model to convert all tabular data into text, which can then be easily analyzed and queried. The major hurdle in implementing this approach at scale, however, seems to be the prohibitive cost associated with it. Is this the primary barrier we're currently facing? Thank you for your insights.
@oleksandrgnatyuk8388 10 місяців тому
It was covered in Q&A, thanks
@AI-Makerspace 9 місяців тому
Great!
@AIEntusiast_ 7 місяців тому ⁺²
what if you hundreds off pdf`s with research data ? Can this do several and seperate them if stored in folder ?
@MrMoonsilver 9 місяців тому ⁺¹
Great approach! How does it work with Diagrams and Graphs?
@AI-Makerspace 9 місяців тому
We haven't seen any tools that give amazing results for diagrams and graphs yet, but we'll be keeping our eyes peeled in 2024 for new Llama packs and LangChain libraries that aim to solve this important problem!
@alchemication 10 місяців тому ⁺¹
Awesome idea to convert pdf to html 🎉 I guess with human written text in pdfs, there might be no other option than some kinda OCR… the story of my life 🙃
@bmw871011 10 місяців тому ⁺⁴
"we grab the package, and then install the package", oh like it's just that easy. i'm on mac and been stuck on this part alone for 2 hours, spinning in circles...ugh. wish would provide more context
@AI-Makerspace 10 місяців тому
Absolutely understand the frustration!
Can you let me know which package and I'll ensure better instructions are provided in the notebook!
@iskrabesamrtna 9 місяців тому
no way to mess this up, colab is literally virtual environment, ig you're trying this locally. it can be a mess with mismatch version of packages.
i know the struggle, don't give up❣️
@taylorfans1000 7 місяців тому
Do you have any notebooks where in you have built a RAG using pdf images, text as well as tables all in one? That would be great.
@AI-Makerspace 7 місяців тому
We don't yet have that example!
@taylorfans1000 7 місяців тому
@@AI-MakerspaceOkay, do you have any ideas on how i can build one because everything available right now is quite here and there.
@dmikas139 9 місяців тому
Hi! Thanks so much for your work! It's partially explained in the video, but it's still a bit hard to understand how exactly the `base.py` must be modified:
1. Indentation in the markdown cell is broken, and
2. Imports are not mentioned.
Maybe you can update it in the notebook.
@dmikas139 9 місяців тому
Also, the second notebook seems to be using `base.py` components *before* these modifications, which is confusing.
@AI-Makerspace 9 місяців тому
For sure!
I've updated the notebook to resolve the indentation issue, and add the relevant import.
@AI-Makerspace 9 місяців тому
The second notebook is just unwrapping what the first is doing under the hood - in that case we don't need to provide specific modifications to base.py.
@andrewandreas5795 9 місяців тому
really awesome video!! any idea on how to get the schema of a whole table?
@AI-Makerspace 9 місяців тому ⁺¹
Can you clarify your question a bit so I can answer it properly? Thank you!
@andrewandreas5795 9 місяців тому
@@AI-Makerspace thanks for your answer. at 18:10 on the slides is stated that you can get a summary of the table but also the schema? did I misunderstand that this is an schema of the table?
@MrDonald911 8 місяців тому
since hierarchical indexing is need, how about representing the data as a graph database directly though neo4j ? I think it will contain more information since the edges are labelled.
@AI-Makerspace 8 місяців тому
That is certainly something you could consider if you can capture the table information reliably! Graph Indexes are definitely a strong form of index!
@releaserbitcoin 7 місяців тому
Is this possible without openai? I don't have an openai api key so it would be grate if we could use open source models. Thanks!!
@AI-Makerspace 7 місяців тому
You could use Open Source models!
@vihardesu9985 5 місяців тому
how do you highlight underlying parts of the document when doing q&a retrieval?
@AI-Makerspace 5 місяців тому ⁺¹
Can you clarify your question?
@vihardesu9985 5 місяців тому
Yeah sure. Let’s say an answer to your question is sentence on a PDF page. How could you implement this so that the sentence in the PDF is highlighted.
@pranjalsarin4421 10 місяців тому ⁺¹
Is there any module/function alternative to this in langchain?
@AI-Makerspace 10 місяців тому
Not at this time, there isn't! But watch this space, as I'm sure we'll see similar functionality rolled out before too long!
@amethyst1044 9 місяців тому
I have question, does anyone know how come turning the data into HTML format makes it more exploitable and gives better results when querying it using an LLM, rather than with simple embeddings from he initial PDF format ?
@AI-Makerspace 9 місяців тому
The addition of any structure with the HTML format allows the Unstructured library to build a more complete node-graph!
@roberth8737 10 місяців тому
This is a great approach
@kovcic_ 8 місяців тому
The HTML results of all pdf's I converted using pdf2htmlEX tool do not contain HTML markup. that way UnstructuredElement parser doesn't produce any table related IndexNode, so no table is indexed. any idea?
@AI-Makerspace 8 місяців тому
This can be due to the provided PDF, or more, the library does not always collect the necessary tabular data - and will therefore rely more on the unstructured knowledge without an awareness of structure.
@kovcic_ 8 місяців тому
@@AI-Makerspacedoes it work for you using the pdf file from the video: d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/7df4dbdc-eb62-4d53-bc27-d334bfcb2335.pdf
I tried the same file and pdf2htmlEX didn't extract any tabular data either.
@kovcic_ 8 місяців тому
I inspected generated quarterly-nvidia.html file (using the above Notebook 1 (EmbeddedTablesUnstructuredRetrieverPack) ) and it also doesn't contain any detected, so I'm not sure what's the purpose of this video if the demo code doesn't do the main thing: table extraction?
@raymondcruzin6213 9 місяців тому
Can we use gpt 3.5 turbo or 16k ( version 0613 )? I cant get node mappings after trying the same setup. Thank you in advance
@AI-Makerspace 9 місяців тому
You should be able to use GPT-3.5-turbo - but GPT-4 is definitely a more consistent way to go.
@haliterdogan2384 9 місяців тому
I'm trying to run notebook 1. I just ran the first code "!pip install llama-index llama-hub unstructured==0.10.18 lxml cohere -qU"
I got this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-probability 0.22.0 requires typing-extensions
@AI-Makerspace 9 місяців тому
That's totally fine!
@amethyst1044 9 місяців тому
add typing-extensions==4.5.0 in you pip install command
@charleskilpatrick9704 8 місяців тому
@@amethyst1044 😀
@ai-whisperer 10 місяців тому
@Greg you sound under the weather, hope you feel better soon. Thanks for doing this regardless 🥹🙌
@AI-Makerspace 10 місяців тому ⁺¹
Haha thanks so much @ai-whisperer! When it's show time it's go time. Feeling much better this week!!
@goidealsoftware 10 місяців тому ⁺¹
❤‍🔥
@carlosg1535 8 місяців тому
I was pretty unhappy with the results with GPT 3.5. The tool pdf2htmlEX loses the structure information when converted into text and that makes most LLMs fail at the task.
Using `python3 -m fitz gettext -mode layout -pages 3 -output /dev/stdout ~/Downloads/quarterly-nvidia.pdf` gave much better results even with smaller models because it keeps the structure even when turned into text.
@AI-Makerspace 8 місяців тому
Great tip, and the tool seems to "does what it says on the tin", thanks for sharing!
@andresrubio2015 10 місяців тому
Thanks
@nitinkavade1684 10 місяців тому ⁺²
As I understand, OpenAI along with GPT4-Turbo next year will do RAG (text node, table node, image node) everything by itself. You only need to query. People should keep exploring though but I feel that all this effort wont yield anything as OpenAI is on track on disrupt RAG next year.
@iamjankoch 10 місяців тому
Haven’t heard about that yet, is there a doc or announcement?
@olafge 10 місяців тому ⁺¹
As of now OAI's RAG is rather basic and you have no means to fine tune it. Moreover, most businesses and also people don't like to upload their documents to OAI. How to update documents? How to evaluate the effectiveness of your OAI RAG? IMHO it's an easy way to do basic RAG, but nothing else.
@antopolskiy 10 місяців тому ⁺⁶
the thing is, it is still worth it to rely as little as possible on openai. as long as you just rely on an llm, it is easily swappable, but if you build your stack fully on openai framework (RAG, assistants, etc) you are completely locked in. heck, we cannot even use half of it with azure openai
@olafge 10 місяців тому ⁺¹
@@antopolskiy excellent point. Fully agree.
@nitinkavade1684 10 місяців тому
@@antopolskiy completely agree. Even with azure we too are struggling. And as you would have experienced their customer service and even the response to queries is terrible, it's like they just don't care. They say they have multiple regions we can get access to buy that's outright bullshit..
@stanTrX 5 місяців тому ⁺¹
Too technical for me but thanks
@AI-Makerspace 5 місяців тому ⁺¹
What sort of content (entitled, say, RAG for Complex PDFs) would help you out more in your role? Thanks!
@stanTrX 5 місяців тому ⁺¹
@@AI-Makerspace i need to work on it but so far i am using anything llm, which is ready to use out of the box and also autogen studio. However what i see is such no code solutions lead you to some extend, easy to use but, sometimes limits you.
@AI-Makerspace 5 місяців тому
@@stanTrX great feedback! Awesome to hear that you're running up against limitations of no-code tools! We recommend starting here if you want to get more into building your own applications: github.com/AI-Maker-Space/Beyond-ChatGPT
@amethyst1044 9 місяців тому ⁺¹
Great video, thank you !
However I am having an issue I would love your insight about
When I try to call `EmbeddedTablesUnstructuredRetrieverPack` using the converted pdf --> html file, I get : Embeddings have been explicitly disabled. Using MockEmbedding.
then when I try to run a query with a question using run, I always get an empty response ( which I think is due to the previous point about the embedding not functioning, but in this case, how is llama index's vector store created then)
I am a bit confused on how to solve this, would appreciate your help
@AI-Makerspace 9 місяців тому
Can you provide any additional details? Did you input your OpenAI API key - or provide an embedding model to use with the pack?
@dmikas139 9 місяців тому
@@AI-Makerspace I have the same issue, where running this code:
```
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"quarterly-nvidia/quarterly-nvidia.html",
nodes_save_path="nvidia-quarterly.pkl"
)
```
The terminal output is
```
Embeddings have been explicitly disabled. Using MockEmbedding.
0it [00:00, ?it/s]
```
@dmikas139 9 місяців тому
Specifically, it happens here:
```
raw_nodes = self.node_parser.get_nodes_from_documents(docs)
```
Does it create embeddings for the vector store already at this stage? I can't find where is the embedding model "explicitly disabled" and how to enable it.
@amethyst1044 9 місяців тому
@@AI-Makerspace To answer your question, yes I did introduce my OpenAI API Key, but I did not specify or specifically provide an embedding model to use with the pack, I followed the code as provided in the notebook 1 (not the step by step one, the Llama pack one)
when this portion of code is executed :
```
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"manual/manual.html",
nodes_save_path="manual.pkl")
```
[ manual is the pdf I converted into html]
I get this :
Embeddings have been explicitly disabled. Using MockEmbedding.
0it [00:00, ?it/s]
and when I run :
```
question =
response = embedded_tables_unstructured_pack.run(question)
```
I get : Empty response
I hope this gives more context to you
@sayanbhattacharyya95 9 місяців тому
Thanks for the tutorial!
I'm using LlamaIndex 9.30 and these lines
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"quarterly-nvidia/quarterly-nvidia.html",
nodes_save_path="quarterly-nvidia.pkl")
seem to be producing this message - "Embeddings have been explicitly disabled. Using MockEmbedding."
Any idea how to sort this?
@AI-Makerspace 9 місяців тому
You'll want to specify an embeddings model when creating the UnstructuredRetrieverPack.

Наступне

Автоматичне відтворення