00:06 Tools for working with complex PDFs have limitations. 02:49 Retrieval augmented generation (RAG) aims to improve generations by using references. 08:00 Searching for similarity in vector databases using cosine similarity. 10:48 LLM applications thrive with private and domain-specific data to gain a competitive advantage. 16:10 Complex PDFs can be streamlined by analyzing qualitative and quantitative data within a single document. 18:42 Using hierarchical node references for table summarization and retrieval 23:25 Using the Llama pack for data pre-processing 25:58 RAG connects structured data for hierarchical retrieval 30:32 Improving performance through better model and specific tabular data queries 32:39 Using recursive retriever from llama index to improve PDF processing 37:17 Hierarchical chunking in PDFs requires manual engineering and building node graphs. 39:27 Using HTML representation first provides a huge improvement in preserving structured information compared to directly parsing from a PDF. 44:07 Local inference package for PDF parsing 46:19 Use GPT and synthetic data for building models 50:38 Complex systems are expensive to navigate efficiently 52:50 Unstructured data parsing and PDF extraction tools 57:04 RAG can be used for retrieval augmented question answering, but it's a subset of actual RAG. 59:19 Explore deep language models and cloud computing services
00:06 Tools for working with complex PDFs have limitations. 02:49 Retrieval augmented generation (RAG) aims to improve generations by using references. 08:00 Searching for similarity in vector databases using cosine similarity. 10:48 LLM applications thrive with private and domain-specific data to gain a competitive advantage. 16:10 Complex PDFs can be streamlined by analyzing qualitative and quantitative data within a single document. 18:42 Using hierarchical node references for table summarization and retrieval 23:25 Using the Llama pack for data pre-processing 25:58 RAG connects structured data for hierarchical retrieval 30:32 Improving performance through better model and specific tabular data queries 32:39 Using recursive retriever from llama index to improve PDF processing 37:17 Hierarchical chunking in PDFs requires manual engineering and building node graphs. 39:27 Using HTML representation first provides a huge improvement in preserving structured information compared to directly parsing from a PDF. 44:07 Local inference package for PDF parsing 46:19 Use GPT and synthetic data for building models 50:38 Complex systems are expensive to navigate efficiently 52:50 Unstructured data parsing and PDF extraction tools 57:04 RAG can be used for retrieval augmented question answering, but it's a subset of actual RAG. 59:19 Explore deep language models and cloud computing services
🎯 Key Takeaways for quick navigation: 00:03 🗂️ *Introduction to the complexity of working with PDFs* - Discusses the challenges of working with complex PDFs, - Highlights the limitations of current tools in handling complex PDFs. 01:13 📚 *Overview of the session's objectives* - Outlines the session's goal to explore how to build pipelines for complex PDFs, - Introduces the speakers and their roles. 02:21 🎯 *Session's learning objectives* - Explains that the session will teach how to build retrieval augmented generation (RAG) systems, - Mentions the partnership with Llama Index and Unstructured Do.i for the event. 03:18 🧱 *Core constructs of Llama Index* - Discusses the importance of understanding the core constructs of Llama Index, - Explains how to deal with unstructured text and structured tabular data simultaneously. 04:53 🔄 *Overview of the retrieval augmented generation process* - Provides a detailed explanation of the retrieval augmented generation process, - Emphasizes the importance of retrieval in improving generation. 06:14 📊 *Introduction to Llama Index* - Introduces Llama Index as a data framework for large language model (LLM) applications, - Highlights the importance of data in the success of LLM applications. 07:38 🧩 *Understanding the core constructs of Llama Index* - Explains the concept of nodes in Llama Index, - Discusses the role of the Retriever and the Query Engine in Llama Index. 09:42 📑 *The process of data retrieval* - Describes the process of data retrieval in Llama Index, - Discusses the importance of improving retrieval to improve generation. 11:39 📚 *Core constructs of Llama Index* - Reiterates the importance of understanding the core constructs of Llama Index, - Discusses the concept of nodes and their role in Llama Index. 13:54 📈 *Improving retrieval to improve generation* - Discusses the importance of improving retrieval to improve generation, - Highlights the role of the Retriever and the Query Engine in improving retrieval. 15:03 📊 *The importance of dealing with embedded tables* - Discusses the importance and challenges of dealing with embedded tables in documents, - Highlights the use case of annual reports as an example of documents with both text and tables. 18:11 🛠️ *The process of data pre-processing* - Explains the process of data pre-processing in Llama Index, - Discusses the use of the Embedded Table Unstructured Retrieval Pack for data pre-processing. 22:31 📝 *The process of converting PDFs to HTML* - Demonstrates the process of converting PDFs to HTML for data pre-processing, - Discusses the use of the PDF to HTML EX tool for the conversion process. 25:28 🔄 *The process of recursive retrieval* - Explains the process of recursive retrieval in Llama Index, - Discusses the use of hierarchical node references for recursive retrieval. 26:22 📚 *Using Llama Index for RAG systems* - Discusses the use of Llama Index for building RAG systems, - Highlights the use of GPT-4 and Ada embeddings from OpenAI, - Explains how Llama Index is used to glue all the pieces of a RAG system together. 28:12 🛠️ *Llama Pack for data processing* - Demonstrates the use of Llama Pack for data processing, - Discusses how it can be used to ask questions and get responses, - Explains how to modify the amount of retrieved documents and the LLM for better performance. 31:35 🔄 *Converting PDFs to HTML and building the index* - Explains the process of converting PDFs to HTML and building the index, - Discusses the use of the flat reader from Llama Index and the unstructured node parser, - Highlights the creation of a map and the use of the Vector index store. 33:11 📈 *Importance of data processing* - Discusses the importance of data processing in improving the performance of the system, - Highlights the use of OCR implementations and the conversion process, - Discusses the potential of using multi-modal language models for processing graphics or charts. 35:05 🚀 *Encouragement to start building with these tools* - Encourages viewers to start building with these tools and making an impact, - Discusses the potential of these tools in dealing with various data types, - Highlights the importance of testing and building with these tools. 36:09 💡 *Addressing questions from the chat* - Addresses various questions from the chat about the use of unstructured partition PDF, privacy, and the API, - Discusses the use of locally hosted solutions to keep everything private, - Highlights the potential of modifying the base.py file to run the local pipeline. 50:42 📈 *Scaling and navigating large document spaces* - Discusses the challenges and costs associated with scaling and navigating large document spaces. - Highlights the importance of metadata application and filtering in managing large document spaces. - Suggests breaking down tasks into specific questions to make the process more efficient. 52:18 🗺️ *Node mapping process* - Explains the node mapping process in the document, which involves constructing a hierarchical node structure. - Discusses the importance of understanding the type of text and document component for better data understanding and relevance. 53:40 📄 *Using Pi PDF for text extraction* - Discusses the use of Pi PDF for text extraction from PDFs. - Highlights the limitations of Pi PDF and similar tools in extracting structured data from PDFs. 54:36 📊 *Evaluating the quality of the system* - Discusses the challenges in evaluating the quality of the system, especially in terms of information loss during the conversion process. - Suggests using tools like ragus and llama index LM as a judge Frameworks for evaluation. - Emphasizes the importance of building systems that are sensitive to information loss. 56:38 🤔 *Retrieval Augmented Generation (RAG) vs Document Question Answering* - Discusses the relationship between RAG and Document Question Answering, suggesting that the latter is a subset of the former. - Highlights that RAG can be used for tasks beyond question answering. - Suggests viewing RAG as a question answering machine as a useful starting point for learning. Made with HARPA AI
@Grerg , This is perfect. Our company is working on a project to build a RAG on top of this kind of documents for French language. Thank you for this tutorial. This is very helpful :)
Thank you for this video. Conversion of pdf to html, a new thing I learned for rag. It will be very helpful if you do this with open source LLM as all can't afford openai api.
@@smarasan @souravarua3991 you can definitely use open-source LLMs for any of these applications; however your results may vary. It's best to baseline performance with the OpenAI tools, and then compare the performance to open LLMs. If you're looking for resources on how to leverage open-source models, start here: ua-cam.com/video/bSTlEcAcx1o/v-deo.htmlsi=ZjNZ9zTYZBkGu6fy Then, if you want to see how to set up Llama 2 for a RAG system look here: ua-cam.com/video/JpQ61Vi5ijs/v-deo.htmlsi=CaewNNpiH5a0Oa07 You can also find Mistral-7B information here: ua-cam.com/video/Id4COsCrIms/v-deo.htmlsi=2WLel6p-PiYoCy_X Basically, you'll just want to add an open-source LLM to the service context in the LlamaPack notebook (Notebook 1). Let us know how it goes!
I've had trouble getting clear answers to basic questions from financial statements, like 'What’s the cost of revenue for the three months ended October 29, 2023?', following your approach. But when I switch to OpenAI's vision model and feed it a screenshot of the relevant table, the results are spot-on. This vision model can interpret and describe a whole table very accurately (just as I remember being asked to do in my Statistics 101 class 👹). So, it seems logical to suggest that we could streamline the process by feeding entire financial documents into the vision model. This would allow the model to convert all tabular data into text, which can then be easily analyzed and queried. The major hurdle in implementing this approach at scale, however, seems to be the prohibitive cost associated with it. Is this the primary barrier we're currently facing? Thank you for your insights.
We haven't seen any tools that give amazing results for diagrams and graphs yet, but we'll be keeping our eyes peeled in 2024 for new Llama packs and LangChain libraries that aim to solve this important problem!
Awesome idea to convert pdf to html 🎉 I guess with human written text in pdfs, there might be no other option than some kinda OCR… the story of my life 🙃
"we grab the package, and then install the package", oh like it's just that easy. i'm on mac and been stuck on this part alone for 2 hours, spinning in circles...ugh. wish would provide more context
no way to mess this up, colab is literally virtual environment, ig you're trying this locally. it can be a mess with mismatch version of packages. i know the struggle, don't give up❣️
Hi! Thanks so much for your work! It's partially explained in the video, but it's still a bit hard to understand how exactly the `base.py` must be modified: 1. Indentation in the markdown cell is broken, and 2. Imports are not mentioned. Maybe you can update it in the notebook.
The second notebook is just unwrapping what the first is doing under the hood - in that case we don't need to provide specific modifications to base.py.
@@AI-Makerspace thanks for your answer. at 18:10 on the slides is stated that you can get a summary of the table but also the schema? did I misunderstand that this is an schema of the table?
since hierarchical indexing is need, how about representing the data as a graph database directly though neo4j ? I think it will contain more information since the edges are labelled.
Yeah sure. Let’s say an answer to your question is sentence on a PDF page. How could you implement this so that the sentence in the PDF is highlighted.
I have question, does anyone know how come turning the data into HTML format makes it more exploitable and gives better results when querying it using an LLM, rather than with simple embeddings from he initial PDF format ?
The HTML results of all pdf's I converted using pdf2htmlEX tool do not contain HTML markup. that way UnstructuredElement parser doesn't produce any table related IndexNode, so no table is indexed. any idea?
This can be due to the provided PDF, or more, the library does not always collect the necessary tabular data - and will therefore rely more on the unstructured knowledge without an awareness of structure.
@@AI-Makerspacedoes it work for you using the pdf file from the video: d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/7df4dbdc-eb62-4d53-bc27-d334bfcb2335.pdf I tried the same file and pdf2htmlEX didn't extract any tabular data either.
I inspected generated quarterly-nvidia.html file (using the above Notebook 1 (EmbeddedTablesUnstructuredRetrieverPack) ) and it also doesn't contain any detected, so I'm not sure what's the purpose of this video if the demo code doesn't do the main thing: table extraction?
I'm trying to run notebook 1. I just ran the first code "!pip install llama-index llama-hub unstructured==0.10.18 lxml cohere -qU" I got this error: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow-probability 0.22.0 requires typing-extensions
I was pretty unhappy with the results with GPT 3.5. The tool pdf2htmlEX loses the structure information when converted into text and that makes most LLMs fail at the task. Using `python3 -m fitz gettext -mode layout -pages 3 -output /dev/stdout ~/Downloads/quarterly-nvidia.pdf` gave much better results even with smaller models because it keeps the structure even when turned into text.
As I understand, OpenAI along with GPT4-Turbo next year will do RAG (text node, table node, image node) everything by itself. You only need to query. People should keep exploring though but I feel that all this effort wont yield anything as OpenAI is on track on disrupt RAG next year.
As of now OAI's RAG is rather basic and you have no means to fine tune it. Moreover, most businesses and also people don't like to upload their documents to OAI. How to update documents? How to evaluate the effectiveness of your OAI RAG? IMHO it's an easy way to do basic RAG, but nothing else.
the thing is, it is still worth it to rely as little as possible on openai. as long as you just rely on an llm, it is easily swappable, but if you build your stack fully on openai framework (RAG, assistants, etc) you are completely locked in. heck, we cannot even use half of it with azure openai
@@antopolskiy completely agree. Even with azure we too are struggling. And as you would have experienced their customer service and even the response to queries is terrible, it's like they just don't care. They say they have multiple regions we can get access to buy that's outright bullshit..
@@AI-Makerspace i need to work on it but so far i am using anything llm, which is ready to use out of the box and also autogen studio. However what i see is such no code solutions lead you to some extend, easy to use but, sometimes limits you.
@@stanTrX great feedback! Awesome to hear that you're running up against limitations of no-code tools! We recommend starting here if you want to get more into building your own applications: github.com/AI-Maker-Space/Beyond-ChatGPT
Great video, thank you ! However I am having an issue I would love your insight about When I try to call `EmbeddedTablesUnstructuredRetrieverPack` using the converted pdf --> html file, I get : Embeddings have been explicitly disabled. Using MockEmbedding. then when I try to run a query with a question using run, I always get an empty response ( which I think is due to the previous point about the embedding not functioning, but in this case, how is llama index's vector store created then) I am a bit confused on how to solve this, would appreciate your help
@@AI-Makerspace I have the same issue, where running this code: ``` embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack( "quarterly-nvidia/quarterly-nvidia.html", nodes_save_path="nvidia-quarterly.pkl" ) ``` The terminal output is ``` Embeddings have been explicitly disabled. Using MockEmbedding. 0it [00:00, ?it/s] ```
Specifically, it happens here: ``` raw_nodes = self.node_parser.get_nodes_from_documents(docs) ``` Does it create embeddings for the vector store already at this stage? I can't find where is the embedding model "explicitly disabled" and how to enable it.
@@AI-Makerspace To answer your question, yes I did introduce my OpenAI API Key, but I did not specify or specifically provide an embedding model to use with the pack, I followed the code as provided in the notebook 1 (not the step by step one, the Llama pack one) when this portion of code is executed : ``` embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack( "manual/manual.html", nodes_save_path="manual.pkl") ``` [ manual is the pdf I converted into html] I get this : Embeddings have been explicitly disabled. Using MockEmbedding. 0it [00:00, ?it/s] and when I run : ``` question = response = embedded_tables_unstructured_pack.run(question) ``` I get : Empty response I hope this gives more context to you
Thanks for the tutorial! I'm using LlamaIndex 9.30 and these lines embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack( "quarterly-nvidia/quarterly-nvidia.html", nodes_save_path="quarterly-nvidia.pkl") seem to be producing this message - "Embeddings have been explicitly disabled. Using MockEmbedding." Any idea how to sort this?
Notebook 1(EmbeddedTablesUnstructuredRetrieverPack): colab.research.google.com/drive/1AyKbc6DYU3b9gWq52SU7qacH7q8DeKOf?usp=sharing
Notebook 2 (Step-by-Step): colab.research.google.com/drive/1ffaV5iFRvhzGIq8YSckf-VJ-y0AZFmA-?usp=sharing
Slides: www.canva.com/design/DAF2T6jNxuk/5HNaXFnIas1oE5bKCCLHkg/view
00:06 Tools for working with complex PDFs have limitations.
02:49 Retrieval augmented generation (RAG) aims to improve generations by using references.
08:00 Searching for similarity in vector databases using cosine similarity.
10:48 LLM applications thrive with private and domain-specific data to gain a competitive advantage.
16:10 Complex PDFs can be streamlined by analyzing qualitative and quantitative data within a single document.
18:42 Using hierarchical node references for table summarization and retrieval
23:25 Using the Llama pack for data pre-processing
25:58 RAG connects structured data for hierarchical retrieval
30:32 Improving performance through better model and specific tabular data queries
32:39 Using recursive retriever from llama index to improve PDF processing
37:17 Hierarchical chunking in PDFs requires manual engineering and building node graphs.
39:27 Using HTML representation first provides a huge improvement in preserving structured information compared to directly parsing from a PDF.
44:07 Local inference package for PDF parsing
46:19 Use GPT and synthetic data for building models
50:38 Complex systems are expensive to navigate efficiently
52:50 Unstructured data parsing and PDF extraction tools
57:04 RAG can be used for retrieval augmented question answering, but it's a subset of actual RAG.
59:19 Explore deep language models and cloud computing services
Awesome recap!
00:06 Tools for working with complex PDFs have limitations.
02:49 Retrieval augmented generation (RAG) aims to improve generations by using references.
08:00 Searching for similarity in vector databases using cosine similarity.
10:48 LLM applications thrive with private and domain-specific data to gain a competitive advantage.
16:10 Complex PDFs can be streamlined by analyzing qualitative and quantitative data within a single document.
18:42 Using hierarchical node references for table summarization and retrieval
23:25 Using the Llama pack for data pre-processing
25:58 RAG connects structured data for hierarchical retrieval
30:32 Improving performance through better model and specific tabular data queries
32:39 Using recursive retriever from llama index to improve PDF processing
37:17 Hierarchical chunking in PDFs requires manual engineering and building node graphs.
39:27 Using HTML representation first provides a huge improvement in preserving structured information compared to directly parsing from a PDF.
44:07 Local inference package for PDF parsing
46:19 Use GPT and synthetic data for building models
50:38 Complex systems are expensive to navigate efficiently
52:50 Unstructured data parsing and PDF extraction tools
57:04 RAG can be used for retrieval augmented question answering, but it's a subset of actual RAG.
59:19 Explore deep language models and cloud computing services
🎯 Key Takeaways for quick navigation:
00:03 🗂️ *Introduction to the complexity of working with PDFs*
- Discusses the challenges of working with complex PDFs,
- Highlights the limitations of current tools in handling complex PDFs.
01:13 📚 *Overview of the session's objectives*
- Outlines the session's goal to explore how to build pipelines for complex PDFs,
- Introduces the speakers and their roles.
02:21 🎯 *Session's learning objectives*
- Explains that the session will teach how to build retrieval augmented generation (RAG) systems,
- Mentions the partnership with Llama Index and Unstructured Do.i for the event.
03:18 🧱 *Core constructs of Llama Index*
- Discusses the importance of understanding the core constructs of Llama Index,
- Explains how to deal with unstructured text and structured tabular data simultaneously.
04:53 🔄 *Overview of the retrieval augmented generation process*
- Provides a detailed explanation of the retrieval augmented generation process,
- Emphasizes the importance of retrieval in improving generation.
06:14 📊 *Introduction to Llama Index*
- Introduces Llama Index as a data framework for large language model (LLM) applications,
- Highlights the importance of data in the success of LLM applications.
07:38 🧩 *Understanding the core constructs of Llama Index*
- Explains the concept of nodes in Llama Index,
- Discusses the role of the Retriever and the Query Engine in Llama Index.
09:42 📑 *The process of data retrieval*
- Describes the process of data retrieval in Llama Index,
- Discusses the importance of improving retrieval to improve generation.
11:39 📚 *Core constructs of Llama Index*
- Reiterates the importance of understanding the core constructs of Llama Index,
- Discusses the concept of nodes and their role in Llama Index.
13:54 📈 *Improving retrieval to improve generation*
- Discusses the importance of improving retrieval to improve generation,
- Highlights the role of the Retriever and the Query Engine in improving retrieval.
15:03 📊 *The importance of dealing with embedded tables*
- Discusses the importance and challenges of dealing with embedded tables in documents,
- Highlights the use case of annual reports as an example of documents with both text and tables.
18:11 🛠️ *The process of data pre-processing*
- Explains the process of data pre-processing in Llama Index,
- Discusses the use of the Embedded Table Unstructured Retrieval Pack for data pre-processing.
22:31 📝 *The process of converting PDFs to HTML*
- Demonstrates the process of converting PDFs to HTML for data pre-processing,
- Discusses the use of the PDF to HTML EX tool for the conversion process.
25:28 🔄 *The process of recursive retrieval*
- Explains the process of recursive retrieval in Llama Index,
- Discusses the use of hierarchical node references for recursive retrieval.
26:22 📚 *Using Llama Index for RAG systems*
- Discusses the use of Llama Index for building RAG systems,
- Highlights the use of GPT-4 and Ada embeddings from OpenAI,
- Explains how Llama Index is used to glue all the pieces of a RAG system together.
28:12 🛠️ *Llama Pack for data processing*
- Demonstrates the use of Llama Pack for data processing,
- Discusses how it can be used to ask questions and get responses,
- Explains how to modify the amount of retrieved documents and the LLM for better performance.
31:35 🔄 *Converting PDFs to HTML and building the index*
- Explains the process of converting PDFs to HTML and building the index,
- Discusses the use of the flat reader from Llama Index and the unstructured node parser,
- Highlights the creation of a map and the use of the Vector index store.
33:11 📈 *Importance of data processing*
- Discusses the importance of data processing in improving the performance of the system,
- Highlights the use of OCR implementations and the conversion process,
- Discusses the potential of using multi-modal language models for processing graphics or charts.
35:05 🚀 *Encouragement to start building with these tools*
- Encourages viewers to start building with these tools and making an impact,
- Discusses the potential of these tools in dealing with various data types,
- Highlights the importance of testing and building with these tools.
36:09 💡 *Addressing questions from the chat*
- Addresses various questions from the chat about the use of unstructured partition PDF, privacy, and the API,
- Discusses the use of locally hosted solutions to keep everything private,
- Highlights the potential of modifying the base.py file to run the local pipeline.
50:42 📈 *Scaling and navigating large document spaces*
- Discusses the challenges and costs associated with scaling and navigating large document spaces.
- Highlights the importance of metadata application and filtering in managing large document spaces.
- Suggests breaking down tasks into specific questions to make the process more efficient.
52:18 🗺️ *Node mapping process*
- Explains the node mapping process in the document, which involves constructing a hierarchical node structure.
- Discusses the importance of understanding the type of text and document component for better data understanding and relevance.
53:40 📄 *Using Pi PDF for text extraction*
- Discusses the use of Pi PDF for text extraction from PDFs.
- Highlights the limitations of Pi PDF and similar tools in extracting structured data from PDFs.
54:36 📊 *Evaluating the quality of the system*
- Discusses the challenges in evaluating the quality of the system, especially in terms of information loss during the conversion process.
- Suggests using tools like ragus and llama index LM as a judge Frameworks for evaluation.
- Emphasizes the importance of building systems that are sensitive to information loss.
56:38 🤔 *Retrieval Augmented Generation (RAG) vs Document Question Answering*
- Discusses the relationship between RAG and Document Question Answering, suggesting that the latter is a subset of the former.
- Highlights that RAG can be used for tasks beyond question answering.
- Suggests viewing RAG as a question answering machine as a useful starting point for learning.
Made with HARPA AI
Thanks for the recap @twoplustwo5!!
@Grerg , This is perfect. Our company is working on a project to build a RAG on top of this kind of documents for French language. Thank you for this tutorial. This is very helpful :)
Love to hear this!! Good luck getting your system dialed in and creating huge value for you!
Excellent intro to the topic. Thank you guys
Thank you! Glad it helped!
As a non-native English speaker, I really like your accent.
Thanks!
Very Informative. Great Work. Thanks
Thank you for this video. Conversion of pdf to html, a new thing I learned for rag. It will be very helpful if you do this with open source LLM as all can't afford openai api.
Great job! comprehensive and clear explanations. How about using llama2 or mistral llm with complex pdfs?
@@smarasan @souravarua3991 you can definitely use open-source LLMs for any of these applications; however your results may vary. It's best to baseline performance with the OpenAI tools, and then compare the performance to open LLMs.
If you're looking for resources on how to leverage open-source models, start here:
ua-cam.com/video/bSTlEcAcx1o/v-deo.htmlsi=ZjNZ9zTYZBkGu6fy
Then, if you want to see how to set up Llama 2 for a RAG system look here:
ua-cam.com/video/JpQ61Vi5ijs/v-deo.htmlsi=CaewNNpiH5a0Oa07
You can also find Mistral-7B information here: ua-cam.com/video/Id4COsCrIms/v-deo.htmlsi=2WLel6p-PiYoCy_X
Basically, you'll just want to add an open-source LLM to the service context in the LlamaPack notebook (Notebook 1).
Let us know how it goes!
I've had trouble getting clear answers to basic questions from financial statements, like 'What’s the cost of revenue for the three months ended October 29, 2023?', following your approach. But when I switch to OpenAI's vision model and feed it a screenshot of the relevant table, the results are spot-on. This vision model can interpret and describe a whole table very accurately (just as I remember being asked to do in my Statistics 101 class 👹).
So, it seems logical to suggest that we could streamline the process by feeding entire financial documents into the vision model. This would allow the model to convert all tabular data into text, which can then be easily analyzed and queried. The major hurdle in implementing this approach at scale, however, seems to be the prohibitive cost associated with it. Is this the primary barrier we're currently facing? Thank you for your insights.
It was covered in Q&A, thanks
Great!
what if you hundreds off pdf`s with research data ? Can this do several and seperate them if stored in folder ?
Great approach! How does it work with Diagrams and Graphs?
We haven't seen any tools that give amazing results for diagrams and graphs yet, but we'll be keeping our eyes peeled in 2024 for new Llama packs and LangChain libraries that aim to solve this important problem!
Awesome idea to convert pdf to html 🎉 I guess with human written text in pdfs, there might be no other option than some kinda OCR… the story of my life 🙃
"we grab the package, and then install the package", oh like it's just that easy. i'm on mac and been stuck on this part alone for 2 hours, spinning in circles...ugh. wish would provide more context
Absolutely understand the frustration!
Can you let me know which package and I'll ensure better instructions are provided in the notebook!
no way to mess this up, colab is literally virtual environment, ig you're trying this locally. it can be a mess with mismatch version of packages.
i know the struggle, don't give up❣️
Do you have any notebooks where in you have built a RAG using pdf images, text as well as tables all in one? That would be great.
We don't yet have that example!
@@AI-MakerspaceOkay, do you have any ideas on how i can build one because everything available right now is quite here and there.
Hi! Thanks so much for your work! It's partially explained in the video, but it's still a bit hard to understand how exactly the `base.py` must be modified:
1. Indentation in the markdown cell is broken, and
2. Imports are not mentioned.
Maybe you can update it in the notebook.
Also, the second notebook seems to be using `base.py` components *before* these modifications, which is confusing.
For sure!
I've updated the notebook to resolve the indentation issue, and add the relevant import.
The second notebook is just unwrapping what the first is doing under the hood - in that case we don't need to provide specific modifications to base.py.
really awesome video!! any idea on how to get the schema of a whole table?
Can you clarify your question a bit so I can answer it properly? Thank you!
@@AI-Makerspace thanks for your answer. at 18:10 on the slides is stated that you can get a summary of the table but also the schema? did I misunderstand that this is an schema of the table?
since hierarchical indexing is need, how about representing the data as a graph database directly though neo4j ? I think it will contain more information since the edges are labelled.
That is certainly something you could consider if you can capture the table information reliably! Graph Indexes are definitely a strong form of index!
Is this possible without openai? I don't have an openai api key so it would be grate if we could use open source models. Thanks!!
You could use Open Source models!
how do you highlight underlying parts of the document when doing q&a retrieval?
Can you clarify your question?
Yeah sure. Let’s say an answer to your question is sentence on a PDF page. How could you implement this so that the sentence in the PDF is highlighted.
Is there any module/function alternative to this in langchain?
Not at this time, there isn't! But watch this space, as I'm sure we'll see similar functionality rolled out before too long!
I have question, does anyone know how come turning the data into HTML format makes it more exploitable and gives better results when querying it using an LLM, rather than with simple embeddings from he initial PDF format ?
The addition of any structure with the HTML format allows the Unstructured library to build a more complete node-graph!
This is a great approach
The HTML results of all pdf's I converted using pdf2htmlEX tool do not contain HTML markup. that way UnstructuredElement parser doesn't produce any table related IndexNode, so no table is indexed. any idea?
This can be due to the provided PDF, or more, the library does not always collect the necessary tabular data - and will therefore rely more on the unstructured knowledge without an awareness of structure.
@@AI-Makerspacedoes it work for you using the pdf file from the video: d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/7df4dbdc-eb62-4d53-bc27-d334bfcb2335.pdf
I tried the same file and pdf2htmlEX didn't extract any tabular data either.
I inspected generated quarterly-nvidia.html file (using the above Notebook 1 (EmbeddedTablesUnstructuredRetrieverPack) ) and it also doesn't contain any detected, so I'm not sure what's the purpose of this video if the demo code doesn't do the main thing: table extraction?
Can we use gpt 3.5 turbo or 16k ( version 0613 )? I cant get node mappings after trying the same setup. Thank you in advance
You should be able to use GPT-3.5-turbo - but GPT-4 is definitely a more consistent way to go.
I'm trying to run notebook 1. I just ran the first code "!pip install llama-index llama-hub unstructured==0.10.18 lxml cohere -qU"
I got this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-probability 0.22.0 requires typing-extensions
That's totally fine!
add typing-extensions==4.5.0 in you pip install command
@@amethyst1044 😀
@Greg you sound under the weather, hope you feel better soon. Thanks for doing this regardless 🥹🙌
Haha thanks so much @ai-whisperer! When it's show time it's go time. Feeling much better this week!!
❤🔥
I was pretty unhappy with the results with GPT 3.5. The tool pdf2htmlEX loses the structure information when converted into text and that makes most LLMs fail at the task.
Using `python3 -m fitz gettext -mode layout -pages 3 -output /dev/stdout ~/Downloads/quarterly-nvidia.pdf` gave much better results even with smaller models because it keeps the structure even when turned into text.
Great tip, and the tool seems to "does what it says on the tin", thanks for sharing!
Thanks
As I understand, OpenAI along with GPT4-Turbo next year will do RAG (text node, table node, image node) everything by itself. You only need to query. People should keep exploring though but I feel that all this effort wont yield anything as OpenAI is on track on disrupt RAG next year.
Haven’t heard about that yet, is there a doc or announcement?
As of now OAI's RAG is rather basic and you have no means to fine tune it. Moreover, most businesses and also people don't like to upload their documents to OAI. How to update documents? How to evaluate the effectiveness of your OAI RAG? IMHO it's an easy way to do basic RAG, but nothing else.
the thing is, it is still worth it to rely as little as possible on openai. as long as you just rely on an llm, it is easily swappable, but if you build your stack fully on openai framework (RAG, assistants, etc) you are completely locked in. heck, we cannot even use half of it with azure openai
@@antopolskiy excellent point. Fully agree.
@@antopolskiy completely agree. Even with azure we too are struggling. And as you would have experienced their customer service and even the response to queries is terrible, it's like they just don't care. They say they have multiple regions we can get access to buy that's outright bullshit..
Too technical for me but thanks
What sort of content (entitled, say, RAG for Complex PDFs) would help you out more in your role? Thanks!
@@AI-Makerspace i need to work on it but so far i am using anything llm, which is ready to use out of the box and also autogen studio. However what i see is such no code solutions lead you to some extend, easy to use but, sometimes limits you.
@@stanTrX great feedback! Awesome to hear that you're running up against limitations of no-code tools! We recommend starting here if you want to get more into building your own applications: github.com/AI-Maker-Space/Beyond-ChatGPT
Great video, thank you !
However I am having an issue I would love your insight about
When I try to call `EmbeddedTablesUnstructuredRetrieverPack` using the converted pdf --> html file, I get : Embeddings have been explicitly disabled. Using MockEmbedding.
then when I try to run a query with a question using run, I always get an empty response ( which I think is due to the previous point about the embedding not functioning, but in this case, how is llama index's vector store created then)
I am a bit confused on how to solve this, would appreciate your help
Can you provide any additional details? Did you input your OpenAI API key - or provide an embedding model to use with the pack?
@@AI-Makerspace I have the same issue, where running this code:
```
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"quarterly-nvidia/quarterly-nvidia.html",
nodes_save_path="nvidia-quarterly.pkl"
)
```
The terminal output is
```
Embeddings have been explicitly disabled. Using MockEmbedding.
0it [00:00, ?it/s]
```
Specifically, it happens here:
```
raw_nodes = self.node_parser.get_nodes_from_documents(docs)
```
Does it create embeddings for the vector store already at this stage? I can't find where is the embedding model "explicitly disabled" and how to enable it.
@@AI-Makerspace To answer your question, yes I did introduce my OpenAI API Key, but I did not specify or specifically provide an embedding model to use with the pack, I followed the code as provided in the notebook 1 (not the step by step one, the Llama pack one)
when this portion of code is executed :
```
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"manual/manual.html",
nodes_save_path="manual.pkl")
```
[ manual is the pdf I converted into html]
I get this :
Embeddings have been explicitly disabled. Using MockEmbedding.
0it [00:00, ?it/s]
and when I run :
```
question =
response = embedded_tables_unstructured_pack.run(question)
```
I get : Empty response
I hope this gives more context to you
Thanks for the tutorial!
I'm using LlamaIndex 9.30 and these lines
embedded_tables_unstructured_pack = EmbeddedTablesUnstructuredRetrieverPack(
"quarterly-nvidia/quarterly-nvidia.html",
nodes_save_path="quarterly-nvidia.pkl")
seem to be producing this message - "Embeddings have been explicitly disabled. Using MockEmbedding."
Any idea how to sort this?
You'll want to specify an embeddings model when creating the UnstructuredRetrieverPack.