What if we have no information about the entities? Suppose it is for an application that takes documents as input, in that case we have no sure idea about what the entities will be. How will it work then?
Then you can let a general NER model to parse depending on your use case, or if you do know the domain, like if it's a PII data, Finance dataset, etc. you can run it through a pretrained NER model for that particular domain
Building a good app with a knowledge graph without understanding input data would be challenging. A practical first step would be to generate a summary or extract key topic(s) from the document to understand its content before constructing the graph.
I have a huge company policy document that I want to create a knowledge graph for, how do I define labels for that? or is it better to do it without? If yes can you please guide me how to go about it without defining labels?
By labels I assume you're talking about entity names. Those are things that you should already know or have some common sense about. So you can start there or manually create a few and use LLMs or some other model to extract/generate additional ones based on them.
Would you recommend using the SLIM local models you introduced earlier in this series for NER, intent classification etc to construct knowledge graphs? Looks like it could be a cost-saver plus it offers structured and consistent inputs for graph construction, although I'm not sure if any of the existing available set of SLIMs is well trained enough for this purpose?
if you need to use images as well, you're going to need to use some libraries to identify and extract images. There are few like github.com/ai8hyf/TF-ID or PyMuPDF4LLM.
Thanks for using the the right tools for the purpose. I am looking at a tabular dataset that i want to use as the material for an llm to generate synthetic sample graphs from so instead of extracting it from the Wikipedia page it has to write the page given the base knowledge graph. And I believe an llm is very useful for that.
Mehdi, if you say that LLMs are that excellent at making KGs and you prefer other libraries that are more practical at making KGs, could you say what libraries you mean?
There are many depending on the domain. But here are some of them that tend to work very well for many domains: - github.com/urchade/GLiNER - github.com/universal-ner/universal-ner - github.com/kamalkraj/BERT-NER
@@thanartchamnanyantarakij9950 we can include some of these information! We will host a free presession for folks to get more info about the course. Welcome to join us!
The challenge for me is that LLMs are not consistent within or between documents. In the example, you see "us" and "u.s.". I'm also concerned that Fiat is an Organization but Chrysler is a Company. And in the LLM example of triples, many of the objects are just, well, sentence fragments. The killer feature of KGs is that you can make connections...but the overspecificity would seem to prevent this. For example, I cannot connect Tom Hanks to any other "fourth highest grossing actor"...he's the only one! There seems to be no good way to create a prompt where the LLM generates entities and relationships at a consistent and appropriate level of hyper/hypo-nymy. This is perhaps not surprising given that LLMs don't think, reason, whatever. And therein lies the trap in getting LLMs to lift themselves up by their own bootstraps.
That's exactly my point in the video too. Many people are hyped/over excited to use LLM for extracting name entities and relations especially when you don't define your schema at first. However, there is no guarantee that you get consistent results. Plus the cost if prohibitive!
I think adding an extra layer of adding metadata (e.g. Parent documents) could solve this issue e.g.: - you can have an LLM embedding with semantic ability to go over each chunk and add metadata related to that chunk so that the LLM can understand the context of each word e.g. "u.s" -> "united states, country".
Very informative video with excellent production. Looking forward for more content
Thanks for the informative video. Other than LLMS , could you suggest some approach or models to try for relationship extraction?
There are traditional ner methods. We will share more in new videos!
Thanks for your video, I am also looking forward to new video of relationships extraction by traditional ways!
What if we have no information about the entities? Suppose it is for an application that takes documents as input, in that case we have no sure idea about what the entities will be. How will it work then?
Then you can let a general NER model to parse depending on your use case, or if you do know the domain, like if it's a PII data, Finance dataset, etc. you can run it through a pretrained NER model for that particular domain
@@karthickdurai2157The type of application I'm developing, is intended to work on all types of documents irrespective of the domain.
Building a good app with a knowledge graph without understanding input data would be challenging. A practical first step would be to generate a summary or extract key topic(s) from the document to understand its content before constructing the graph.
I have a huge company policy document that I want to create a knowledge graph for, how do I define labels for that? or is it better to do it without? If yes can you please guide me how to go about it without defining labels?
By labels I assume you're talking about entity names. Those are things that you should already know or have some common sense about. So you can start there or manually create a few and use LLMs or some other model to extract/generate additional ones based on them.
Can you please share the link of the notebook that you went through in the video
Sure. Here's the code: github.com/mallahyari/twosetai/blob/main/02_kg_construction.ipynb
Would you recommend using the SLIM local models you introduced earlier in this series for NER, intent classification etc to construct knowledge graphs? Looks like it could be a cost-saver plus it offers structured and consistent inputs for graph construction, although I'm not sure if any of the existing available set of SLIMs is well trained enough for this purpose?
@@myfolder4561 That potentially is a good idea. We haven’t tried it ourselves. Let us know if you try this approach!
@@myfolder4561 it’s indeed possible you will need to train your own SLIM model for this.
How to use documents which have images like some product manual pdf files. How can we use Grpahrag for this problem?
if you need to use images as well, you're going to need to use some libraries to identify and extract images. There are few like github.com/ai8hyf/TF-ID or PyMuPDF4LLM.
It would help if the code was also available. Could you post the link to your code shown in your Jupyter Notebook ?
@@Ash2Tutorial github.com/mallahyari/twosetai/blob/main/02_kg_construction.ipynb we will share more code in our course. Stay tuned!
Thanks! Will you publish the code or github?
Here's the code: github.com/mallahyari/twosetai
@@MehdiAllahyarithanx!!!
Yes.
Thanks for using the the right tools for the purpose. I am looking at a tabular dataset that i want to use as the material for an llm to generate synthetic sample graphs from so instead of extracting it from the Wikipedia page it has to write the page given the base knowledge graph. And I believe an llm is very useful for that.
Yes for your use case llm is actually the best tool as you want to convert structured data into natural language form.
Mehdi, if you say that LLMs are that excellent at making KGs and you prefer other libraries that are more practical at making KGs, could you say what libraries you mean?
Yes we will share more
There are many depending on the domain. But here are some of them that tend to work very well for many domains:
- github.com/urchade/GLiNER
- github.com/universal-ner/universal-ner
- github.com/kamalkraj/BERT-NER
@@MehdiAllahyari thanx, it looks awesome. I will test it for sure…
You must included your new course about how to build knowledge graph from unstructured data by using llm with advanced technique
@@thanartchamnanyantarakij9950 we can include some of these information! We will host a free presession for folks to get more info about the course. Welcome to join us!
Very interesting review. Any chances you share the code to try it myself?
Thanks in advance.
BTW I'm reading your RAG book.
Awesome! Here's the code: github.com/mallahyari/twosetai/blob/main/02_kg_construction.ipynb
@@apulacheyt thank you! Some materials might be outdated due to changes in the libraries . Check out our course for the latest updates!
what was wrong with the previous video? As always, thank you!
Because the subtitles were distracting, we had to re-upload a new one. Unfortunately, the comments of last video cannot be not displayed for this one!
i removed the subtitle. hopefully this is easier to watch! thanks!
The challenge for me is that LLMs are not consistent within or between documents. In the example, you see "us" and "u.s.". I'm also concerned that Fiat is an Organization but Chrysler is a Company. And in the LLM example of triples, many of the objects are just, well, sentence fragments. The killer feature of KGs is that you can make connections...but the overspecificity would seem to prevent this. For example, I cannot connect Tom Hanks to any other "fourth highest grossing actor"...he's the only one! There seems to be no good way to create a prompt where the LLM generates entities and relationships at a consistent and appropriate level of hyper/hypo-nymy. This is perhaps not surprising given that LLMs don't think, reason, whatever. And therein lies the trap in getting LLMs to lift themselves up by their own bootstraps.
That's exactly my point in the video too. Many people are hyped/over excited to use LLM for extracting name entities and relations especially when you don't define your schema at first. However, there is no guarantee that you get consistent results. Plus the cost if prohibitive!
I think adding an extra layer of adding metadata (e.g. Parent documents) could solve this issue e.g.: - you can have an LLM embedding with semantic ability to go over each chunk and add metadata related to that chunk so that the LLM can understand the context of each word e.g. "u.s" -> "united states, country".
spacy-llm can help you do few shots NER and the performs is almost 99% of traditional approach
I think spacy-llm also uses a LLM behind the scenes, so it may not be as fast as this