Thank You so much for this. A lot of confusion in the textbook but the two videos on both entropy and information gain is a lifesaver! You are the best. Just Subscribed
Naik. Your trainings are very clear and smart. 50% of the professional sare from r back ground. Can you keep r codes also in all videos. Ordering and appropriate long titling is an excellent way. One tip. Please include data types in every discussion as many functions are sensitive to data types. This gives an option to give information of alternatives. Ex: K-Mean clustering, In 99% of articles no one explains how to do this on categorical data. R code will double the no of hits. I wish that you will eventually start your own consulting company on data science
Sir please make more videos on web scrapping from scratch for beginners I am not from cs background..but for data science I think we should have knowledge of web scrapping...sir please make videi on this topic. You are a great teacher and you are a roll model for me Thank you sir.
Hi Krish, I want to mention a small correction here .... f1 has 9Y|5N, so f2 's Y and N should sum up to 9 and also f3's Y and N should sum up to 5...But they are different in both your examples :) ......but other than this your lucid explaination of concepts is quite amazing. You rock !
I will take that back...9Y|5N is for labels....f1 has 2 categories , c1 and c2....c1 is 6Y|2N is c2 is 3Y|3N...incase of numerical feautures, we select a threshold to treat the values as categories, and since we get as many thresholds as number of values, it is computationally expensive.....Completely understood the concept now.
in place of f2 and f3, there should actually be the values of the feature f1 (say, yes or no / high or low). after the split we will get entropy at individual points and can calculate IG for f1. similarly IG for f2, f3 etc can be found. highest IG feature to be chosen. then go ahead with further split. @krish pl correct me if i am wrong
Key takeaway is : if Entropy is closer to 1 or equal to 1 then it is more impure. if Entropy is 0 then it is considered to be leaf node or it is a pure split. Decision tree classification model will be chosen based on Information Gain with highest value.
As per my understanding the entire data is present in root node.And the split happens based on features/column values with respect to target variable by computing entropy and information gain.If one column/one feature is used at one split that feature/column wont be used for further splitting.Correct me if i'm wrong.
It means to calculate information gain various structures of trees will be created.And structure with highest information gain will be taken for Decision Tree training.How it will calculate,how many structures of trees to consider to create the information gain.Basically how many combinations of trees it will create.What is decision criteria for same ?
Python script for the calculation (will work for any tree, not just binary): import math def entropy(set): components = [-x*math.log2(x/sum(set))/sum(set) if x > 0 else 0 for x in set] return sum(components) def gain(*subsets): sample = list(map(sum, zip(*subsets))) proportional_entropy = [entropy(subset)*sum(subset)/sum(sample) for subset in subsets] return entropy(sample), proportional_entropy, entropy(sample)-sum(proportional_entropy) print(gain([6,2],[3,3]))
First of all a big thanks to you as you have made learning very easy and interesting :),,,,if the information gain on one leaf node calculated as 1 and in the other leaf node calculated as 0.4(any value less then 1) then which leaf node to be considered?
@@HShravzP , he said about entropy during this duration but question was asked on Information gain. We have to select the leaf node with higher Information gain value.
Leaf node do not have information gain...leaf node have entropy and the node just before the leaf node have information gain for the split..if the node is going for further split ,the split will be evaluated on the basis of information gain from its child nodes and if its child node is also going to split it will also be evaluated on the basis of information gain from the child's child node...and this process repeats until we get a node which will not need further split(leaf node)...because this greed of achieving purity by splitting the nodes trees always prone to overfit on the training data thats why we use different parameters to control the growth of the tree to stop overfitting and get better generalized tree model for unseen data.
You said that 0 entropy is the worst and Information Gain is actually finding the avg of the Entropy of the whole structure, then according to ur definition the lesser the information gain the better is should be but at the end u said the more the IG the better. You hv contradicted you statements.... Yes explain that correctly
0 entropy is the best. In any scenario we want to minimize the entropy. At each level we want to maximize the information gain because that would lead us the fastest way from the high entropy we have right now to the low entropy we want to get to in the future,
Thank you for the video Krish! When RF uses Gini Index, is it just supplementing H(S) in the information gain formula with GI? In other words, does the information gain concept still applies when using GIni Index?
Thanks for the highly informative tutorials, My question to you : Is there is any option in Decisiontreeclassifier in sklearn to make a node split into three child nodes when the feature used in splitting is categorically coded as (0,1,2) for each of the three categories?
I think f1 is divided into 8yes/6no. Another thing, initially we took f1 as root node and divided into f2 and f3. If this split gives highest information gain, then we will proceed for the next split of f2 and f3. Similarly, information gain will be calculated for next split for f2 and f3 by treating f2 and f3 as root nodes and the process goes on till we reach leaf nodes. Is this understanding correct? Please reply and correct me if I am wrong.
Hi krish, it is very helpful to understand the famous paper "Induction of Decision Trees , J.R. Quinlan". one question in my mind, do we need to covert the features into qualitative values. if yes, than we need to generate the clusters too. If Im right, then how to decide the no. of clusters. because my data is purely quantitative in nature.
You do not need to convert the features into numerical values if you are doing a classification problem. However, you will need to convert them into some sort of continuous values in regression. Features will just be the nodes here on which we check the class of the samples. You do not need to transform it if you are just considering classification
Hey Krish, for the information gain, will they count of all the subsets until the leaf node? let's say over here we want to find information gain for the f1, but the f2 splits further into f4 f5, then will the information gain be calculated based on f2 f3 only, or will it go for f4 f5 f3?
Hi, Thanks for the video. While explaining entropy in the beginning section, you said P+ and P- are percentage of positive and negative values respectively. Is that correct, should it be defined as probability of positive and negative values rather than saying percentage?
hi krish, your videos are good, can you please make a videos on different feature selection methods i.e, filter, wrapper and embedded methods together in detail. thanks in advance
By strings did you mean the categorical data? Example of Categorical columns/features- Gender(male,female,trans,null), maritalstatus(married,notmarried,divorced,widowed,null) If yes, then use encoding techniques to convert categorical variables into integers. There are various techniques and every technique has its own explanation and criteria for its use..figure it out what will be good fit for your case.
Bro you are a legend and a half...Prof spent 3 weeks on this and your 12 minute video just explained this beautifully.
Tera matlab Savva Legend
You guys doing Data Science Degree course From Colleges.
@@raj-nq8ke no
@@Sss-kj1ev computer engineering!
@@Sss-kj1ev dedh legend (shana)
man i truly appreciate you more than any of my college doctors , hope you achieve all ur dreams
I am seeing his videos from my last 2+ years (from college days)...proud see this community and growth he made..good luck mate.
6:36 information gain starts
Great explanation. This is the best channel to becoming a perfect data scientist.
Thank You so much for this. A lot of confusion in the textbook but the two videos on both entropy and information gain is a lifesaver! You are the best. Just Subscribed
Actual working mechanism behind decision tree algorithm is clearly explained. Thanks for uploading !
Naik. Your trainings are very clear and smart.
50% of the professional sare from r back ground. Can you keep r codes also in all videos. Ordering and appropriate long titling is an excellent way.
One tip. Please include data types in every discussion as many functions are sensitive to data types. This gives an option to give information of alternatives.
Ex: K-Mean clustering, In 99% of articles no one explains how to do this on categorical data.
R code will double the no of hits.
I wish that you will eventually start your own consulting company on data science
your teaching skill is amazing ,
LOVE YOU I WATCHED YOUR ENTROPY VIDEO AND NOW THIS. ITS SOOO HELPFUL FOR ME SINCE I WANT TO BE A DATA SCINETIST
Sir please make more videos on web scrapping from scratch for beginners I am not from cs background..but for data science I think we should have knowledge of web scrapping...sir please make videi on this topic. You are a great teacher and you are a roll model for me Thank you sir.
Join as member.. there is one end to end project related to ML
Hats off to you man!! Your teaching skills are amazing.
Better explanation because of the numerical used. Absolutely Beautiful. Superlike
These are the only tutorials I watch and understand the first time
I love the way you explain things. Very clear and easy to digest
Hats off.... No match of you exist ...excellent
best tutorial on ml i could find .Thank you very much krish sir. God bless you
standing ovation for his nice explanation , thanks you very much guy you're so kind
Thanks Krish. There is a slight confusion between entropy and information gain.I am sure it will be clarified in the process.
deepest respect, fantastic explanation
best explanation in entire youtube videos!
Hi Krish, I want to mention a small correction here .... f1 has 9Y|5N, so f2 's Y and N should sum up to 9 and also f3's Y and N should sum up to 5...But they are different in both your examples :) ......but other than this your lucid explaination of concepts is quite amazing. You rock !
I will take that back...9Y|5N is for labels....f1 has 2 categories , c1 and c2....c1 is 6Y|2N is c2 is 3Y|3N...incase of numerical feautures, we select a threshold to treat the values as categories, and since we get as many thresholds as number of values, it is computationally expensive.....Completely understood the concept now.
in place of f2 and f3, there should actually be the values of the feature f1 (say, yes or no / high or low). after the split we will get entropy at individual points and can calculate IG for f1. similarly IG for f2, f3 etc can be found. highest IG feature to be chosen. then go ahead with further split.
@krish pl correct me if i am wrong
Yes u r right
Exactly! :) Also, after Information gain we should do Intrinsic value and then Gain ratio, which is our final result.
Thankyou Krish, you explain everything in detail ! No words to thaankyou
Gracias por tus explicaciones, eres increíble! Nos ayudas muchísimo!
Got the right channel to learn machine learning :) . Thanks Bro.
krish.... you are a true gem
Thanks sir,all sessions are very informative.
Really appreciate your effort and the videos. Thank you very much Krish.
Thank you, Krish sir. Nice video.
I can see his passion for machine learning
Yes
Krish, can you please do vedios on time series analysis please...
Key takeaway is :
if Entropy is closer to 1 or equal to 1 then it is more impure.
if Entropy is 0 then it is considered to be leaf node or it is a pure split.
Decision tree classification model will be chosen based on Information Gain with highest value.
As per my understanding the entire data is present in root node.And the split happens based on features/column values with respect to target variable by computing entropy and information gain.If one column/one feature is used at one split that feature/column wont be used for further splitting.Correct me if i'm wrong.
Yes u r right
@@krishnaik06 Thanks Krish
So clearly you made it. Thankyou
Simply amazing loved the way you explained the concept it was really easy to understand
Thank you sir very helpful video for me. I will definately share with my friends as well
It means to calculate information gain various structures of trees will be created.And structure with highest information gain will be taken for Decision Tree training.How it will calculate,how many structures of trees to consider to create the information gain.Basically how many combinations of trees it will create.What is decision criteria for same ?
IG is calculated at each feture level while constructing the tree.
Only a single tree is created.
superb video and very nicely described..thanks Krish
H(s) = H(f_1) = 0.94. While in the gain formula (black marker) it's 0.91
yes its a mistake but h(s) is 0.94
calculate average of all the entropies for f1, f2 and f3
Yes, it is a mistake, but there will be no much difference in the fraction values. So, no need to worry.
Information gain is entropy of parent - weighted entropy of child
Amazing! You explain everything so well! Thank you!
You're no less than andrew ng for me.. respect++
Very clear explanation. Great job on this video!
Great efforts , Thanks a lot
Beautifully explained....thank you sir!!
Thanks a lot!, very clear explanation
You are awesome Krish
Ultimate Video once again Sir
Python script for the calculation (will work for any tree, not just binary):
import math
def entropy(set):
components = [-x*math.log2(x/sum(set))/sum(set) if x > 0 else 0 for x in set]
return sum(components)
def gain(*subsets):
sample = list(map(sum, zip(*subsets)))
proportional_entropy = [entropy(subset)*sum(subset)/sum(sample) for subset in subsets]
return entropy(sample), proportional_entropy, entropy(sample)-sum(proportional_entropy)
print(gain([6,2],[3,3]))
Good explanation,want more videos on machine Learning, thank you so much krish
Sir u are a life saver ❤
very nice explanation sir!
Your just awesome in teaching online
As Always you are the best
excellent Explanation Sir
thanks for nice explanation
Very well explained ,really helpful .🤗
Simply say Superb, thank you
First of all a big thanks to you as you have made learning very easy and interesting :),,,,if the information gain on one leaf node calculated as 1 and in the other leaf node calculated as 0.4(any value less then 1) then which leaf node to be considered?
based on what he said in video from 4:22 to 5:06 , I think the leaf node with 0.4 is better to consider
@@HShravzP , he said about entropy during this duration but question was asked on Information gain. We have to select the leaf node with higher Information gain value.
@@parveenparveen9384 okay understood👍
Leaf node do not have information gain...leaf node have entropy and the node just before the leaf node have information gain for the split..if the node is going for further split ,the split will be evaluated on the basis of information gain from its child nodes and if its child node is also going to split it will also be evaluated on the basis of information gain from the child's child node...and this process repeats until we get a node which will not need further split(leaf node)...because this greed of achieving purity by splitting the nodes trees always prone to overfit on the training data thats why we use different parameters to control the growth of the tree to stop overfitting and get better generalized tree model for unseen data.
Here H(S) about target variable and then take the difference for each average entropy value of Feature from Entropy of Target to see where to split..
this video was very useful. But please solve a question with attributes.
nice explanation
sir
You said that 0 entropy is the worst and Information Gain is actually finding the avg of the Entropy of the whole structure, then according to ur definition the lesser the information gain the better is should be
but at the end u said the more the IG the better.
You hv contradicted you statements.... Yes explain that correctly
0 entropy is the best. In any scenario we want to minimize the entropy.
At each level we want to maximize the information gain because that would lead us the fastest way from the high entropy we have right now to the low entropy we want to get to in the future,
Hi Krish, it would be more precise to use probability of + than percentage of +.
Yes...if it percentage , why didn't he multiply with 100... it may be probability.
Great explanation
deserves million views
but not many are interested in ml.. ;D
Thanks Alot Krish :)
Thank you for the video Krish! When RF uses Gini Index, is it just supplementing H(S) in the information gain formula with GI? In other words, does the information gain concept still applies when using GIni Index?
Great sir
Man, you rock!
Thanks Krish
Thanks for the highly informative tutorials,
My question to you : Is there is any option in Decisiontreeclassifier in sklearn to make a node split into three child nodes when the feature used in splitting is categorically coded as (0,1,2) for each of the three categories?
How do we determine the leaf nodes? Better, how do we determine where to put the labels?
great understanding,Than you
Thankyou so much sir, helped alot
I think f1 is divided into 8yes/6no.
Another thing, initially we took f1 as root node and divided into f2 and f3. If this split gives highest information gain, then we will proceed for the next split of f2 and f3. Similarly, information gain will be calculated for next split for f2 and f3 by treating f2 and f3 as root nodes and the process goes on till we reach leaf nodes. Is this understanding correct? Please reply and correct me if I am wrong.
Yes it is correct.
Information gain is calculated at each feature level.
Thank you!
You are amazing!
great video. Keep going.
Kindly create playlist on computer vision
Thank you so much
Hi krish, it is very helpful to understand the famous paper "Induction of Decision Trees , J.R. Quinlan". one question in my mind, do we need to covert the features into qualitative values. if yes, than we need to generate the clusters too. If Im right, then how to decide the no. of clusters. because my data is purely quantitative in nature.
You do not need to convert the features into numerical values if you are doing a classification problem. However, you will need to convert them into some sort of continuous values in regression.
Features will just be the nodes here on which we check the class of the samples. You do not need to transform it if you are just considering classification
Thank you 😊
sir, h(s) value was 0.94 but why there is 0.91 in the formula
i think he mistakenly wrote that. it should be .94
krish why are you not applied decision tree practically on python
Sir complete the RNN playlist please
Hi @Arpit next video is RNN only
but why would we start our tree from f2 when we know the entropy of f1 is smaller than the entropy of f2 ?
2:58 Sir Krish, does the symbol P mean probability?
Yes it is probably
Hey Krish, for the information gain, will they count of all the subsets until the leaf node? let's say over here we want to find information gain for the f1, but the f2 splits further into f4 f5, then will the information gain be calculated based on f2 f3 only, or will it go for f4 f5 f3?
It will calculate at each level feature level
Hi, Thanks for the video. While explaining entropy in the beginning section, you said P+ and P- are percentage of positive and negative values respectively. Is that correct, should it be defined as probability of positive and negative values rather than saying percentage?
Yes , it's same
Krish, Can you make some videos on PyTorch
Hey Krish , I have done pandas and matplotlib and seaborn what will be next please help I am confused.
Follow the playlist given in the link
6:54 Sir, what is the difference between the entropy of the class computed at the initial separation and the entropy of each attribute?
They are same
@@mohammedameen3249 Nope. I thought entropy class is entropy before split, entropy each attribute is entropy after split
P+ is not percentage of yes numbers, p+ is fraction of yes numbers
hi krish, your videos are good, can you please make a videos on different feature selection methods i.e, filter, wrapper and embedded methods together in detail. thanks in advance
Does Decision Tree use "one vs rest" mechanism for calculating entropy in multi class classification ?
The goat fr
Plzz show implementation of decision tree for any dataset
Hello Sir, could you please do a project using logistic regression with strings?
By strings did you mean the categorical data?
Example of Categorical columns/features-
Gender(male,female,trans,null),
maritalstatus(married,notmarried,divorced,widowed,null)
If yes, then use encoding techniques to convert categorical variables into integers.
There are various techniques and every technique has its own explanation and criteria for its use..figure it out what will be good fit for your case.