Great videos, thanks! It’s important to understand that IDF reduces the weight of common words that frequently appear in most documents within the corpus, as these words contribute little to document classification. Conversely, it highlights less common words, making them more important for distinguishing the documents in which they appear.
Hi, I have a question if you will (still trying to figure out what are the best ways to use all there different methods): I got some data re requirements for the data analytics in finance from job postings website and wanted to get the sense of what are the most wanted requirements (skills, knowledge) are among those. Now I'm on my way to explore all the methods you explain based on this corpus but it seems that probably for the purpose of summarizing a bunch of similar job requirements' descriptions it's better to use something like key words (mostly threegramms) extraction. So would KeyBert be your choice? Sorry for the long question )
I think KeyBert may be a great option. Out of the box, it will do a lot. It really dependa on the data, though. No two corpora are exactly the same. It will require a bit of experimentation.
@@python-programming huge thanks for your comment, indeed, from the result I get I can better understand where to go further) Looking forward to hearing more from you on this channel on these existing topics and the ways to use them in different contexts. Thanks!
Hi. Thank you for your video. Have you compared TF-IDF that you calculated with the one that Python gives? I use Google Colab When I've calculated I had 0 for "on". TF-IDF = 1/7 * lg (2/2) = 0 But Python gives 0.3 from sklearn.feature_extraction.text import TfidfVectorizer documents = ["The cat is laying on the carpet","The carpet is on the floor "] vectorizer = TfidfVectorizer() X = vectorizer. fit_transform(documents) feature_names = vectorizer. get_feature_names_out() print("tokens:",feature_names) print("matrix:") print(X.toarray().round(2)) Output: tokens: ['carpet' 'cat' 'floor' 'is' 'laying' 'on' 'the'] matrix: [[0.3 0.42 0. 0.3 0.42 0.3 0.6 ] [0.33 0. 0.47 0.33 0. 0.33 0.67]]
Great question. Not all docs in a corpus will have a word. The IDF places a poportional assessment on that word relative to density in a single document against all relevant docs in a corpus. If you just compare TF alone, you would not get a sense of the docs larger place.
The corpus may contain various types of documents, such as newspapers, which will enable us to understand the extent to which the term varies across different kinds of documents.
Great videos, thanks! It’s important to understand that IDF reduces the weight of common words that frequently appear in most documents within the corpus, as these words contribute little to document classification. Conversely, it highlights less common words, making them more important for distinguishing the documents in which they appear.
What a treasure this is! ⚡Many thanks!
So interesting and I've even managed to use some of the ideas at work already 😀
I am so happy to hear that!
this channel is a beauty
This is really useful. Thank you.
No problem!
You are the best. You are so cool.
very clear sir
Thanks!
Thank you so much!!
No problem!!
This is great! A++
Thanks!
Hi, I have a question if you will (still trying to figure out what are the best ways to use all there different methods): I got some data re requirements for the data analytics in finance from job postings website and wanted to get the sense of what are the most wanted requirements (skills, knowledge) are among those. Now I'm on my way to explore all the methods you explain based on this corpus but it seems that probably for the purpose of summarizing a bunch of similar job requirements' descriptions it's better to use something like key words (mostly threegramms) extraction. So would KeyBert be your choice?
Sorry for the long question )
I think KeyBert may be a great option. Out of the box, it will do a lot. It really dependa on the data, though. No two corpora are exactly the same. It will require a bit of experimentation.
@@python-programming huge thanks for your comment, indeed, from the result I get I can better understand where to go further) Looking forward to hearing more from you on this channel on these existing topics and the ways to use them in different contexts. Thanks!
Hi. Thank you for your video. Have you compared TF-IDF that you calculated with the one that Python gives? I use Google Colab
When I've calculated I had 0 for "on". TF-IDF = 1/7 * lg (2/2) = 0
But Python gives 0.3
from sklearn.feature_extraction.text import TfidfVectorizer
documents = ["The cat is laying on the carpet","The carpet is on the floor "]
vectorizer = TfidfVectorizer()
X = vectorizer. fit_transform(documents)
feature_names = vectorizer. get_feature_names_out()
print("tokens:",feature_names)
print("matrix:")
print(X.toarray().round(2))
Output:
tokens: ['carpet' 'cat' 'floor' 'is' 'laying' 'on' 'the']
matrix:
[[0.3 0.42 0. 0.3 0.42 0.3 0.6 ]
[0.33 0. 0.47 0.33 0. 0.33 0.67]]
if i have tweets is it the best to use it for?
Good
Thanks!
if we only have one document that is compiled all our text, will TF-IDF useful?
Yea it can still tell you the most common words within that document, but for that I would use KeyBERT
sir how to access the website?I wanna read some more of it thanks
Wouldn't the IDF score be same for all documents, why do we need to multiply every time with TF score if we just want comparisons?
Great question. Not all docs in a corpus will have a word. The IDF places a poportional assessment on that word relative to density in a single document against all relevant docs in a corpus. If you just compare TF alone, you would not get a sense of the docs larger place.
@@python-programming Got it! Thank You!
@@ayanjain3106 no problem!
You can say that you are comparing after normalizing
The corpus may contain various types of documents, such as newspapers, which will enable us to understand the extent to which the term varies across different kinds of documents.
Thank you so much!!
No problem!