So, training the model brings up the same problem we've had before: how to extract useful training data from things like PDF files. I still maintain that it is absolutely necessary to obey the document structure (at a minimum) to extract useful keys (blobs that will be embedded and then stored in the vector database). Your later video that talks about extracting summaries and embedding those as keys in the database applies here, I think. It shouldn't be all that difficult to create some good heuristics that can handle different type of, say, PDF files: some are papers, some are books. A book has less obvious structure (e.g., numbered headings) than an academic paper (we hope :), so these two types, at least, should be handled differently. Creating snippets of text to embed from a book is possibly a much more difficult task. I guess if we have an LLM with a big enough context, we could present the entire book to the LLM and ask it to generate snippets of text to use as keys. For example, "give me some summaries of the main characters in the book. What are their names, their genders, their main activities, their relationships to other characters, and so forth."
So, training the model brings up the same problem we've had before: how to extract useful training data from things like PDF files. I still maintain that it is absolutely necessary to obey the document structure (at a minimum) to extract useful keys (blobs that will be embedded and then stored in the vector database). Your later video that talks about extracting summaries and embedding those as keys in the database applies here, I think. It shouldn't be all that difficult to create some good heuristics that can handle different type of, say, PDF files: some are papers, some are books. A book has less obvious structure (e.g., numbered headings) than an academic paper (we hope :), so these two types, at least, should be handled differently. Creating snippets of text to embed from a book is possibly a much more difficult task. I guess if we have an LLM with a big enough context, we could present the entire book to the LLM and ask it to generate snippets of text to use as keys. For example, "give me some summaries of the main characters in the book. What are their names, their genders, their main activities, their relationships to other characters, and so forth."
I don't really see the benefit over recent RAG development such as sub-document summary as metadata with each chunk for retrieval.
Even if you use sub-document as metadata, it may not work, since you don't do the reflection/critique which is very important.