man you are too good, after spending my entire day looking for a solution on univariate analysis, all of a sudden i was about to go to bed then i came across your video, guess what? i did it man thanks to you. Kudos Broh you are a star
Hello. For extremely large datasets like CERT r4.2 dataset, which technique would you recommend for filling NaN, missing values? For example, the 'activity' column?
Exactly same question i had. chi2 should only be used between categorical type or discrete variable but not with continuous variable. but he used it will all.
Here is a simple modification of the correlation function, this will return pair of features which are highly correlated with each other. def correlation(df,threshold): corr_features = set() #set allows to store none repeation items corr_matrix = df.corr() for i in range(len(corr_matrix.columns)): # these two loops allows use to scan the first half of the diagonal. Since, the other half is just the mirror of the first half for j in range(i): if abs(corr_matrix.iloc[i,j]) > threshold: # add those features that have correlation more than the threashold value corr_features.add((corr_matrix.columns[i],corr_matrix.columns[j])) #getting name of the columns return corr_features
Hello, In your previous videos, you split the data into test and train and then apply the feature selection on x.train just to avoid the over fitting, but here on 22 minutes using select k best for feature selection ,you are taking the whole X ,why is it so sir? Any one
Sir whether these scores are based on p values which is actually trying to estimate the relationship between individual features and target variable (where Null Hypothesis will be NO Relationship and Alternative Hypothesis will be Relationship)
I tried ktest for feature selection in House Pricing Dataset Problem....as it had many categorical features converted into dummy variables so many columns were just dummy variables with no meaning separately....now on applying ktest...I am getting many dummy variable columns....Now how can i proceed further as including them as a separate feature makes no sense but ktest is indicating high importance
only if independent features are highly correlated with each other that means they are almost same so adding almost same data will not help the model and won't make the model more accurate....it's like telling the model same thing again and again which it already learnt from the previous data.
Sir, i have a question... In individual video you do train test split and compute feature selection technique on xtrain and Then remove correlated feature from xtrain and xtest also.. Sometimes you don't do train test split and direct use df fir feature selection.. So i am confused.. Should i need to do train test split and then after calculate feature selection process or just use df for it... Please reply Sir.. I am so much confused in this... If anyone knows the answer then reply also ... Thanx in advanced.. 🙏🙏🙏
we can do feature selection after split or before split . but there it can bs cause overfitting problem when we select the whole df , so to prevent it we do first split and apply feature selection on X_train and then when u got features we keep same features for X_test as well . if i am doing wrong please guide meme.
Just a friendly reminder brother do not learn things hard coded, first analyze why do you need to apply a certain algorithm on a data set then check whether or not is there really the need of that algorithm to be applied on that data set? Ask some questions to yourselves like: Can't we achieve the results without applying it? Is there any other, more optimized, algorithm to do the same task? What could be the merits and demerits of applying it on our data set? Then later after answering all of those questions, make your decision. The question you are asking is completely based on this principle, it is not necessarily true that you should apply feature selection or feature engineering before or after the splitting, you should know that the size of data will vary your correlation scores, if you have significantly less data then you should not even think of applying the splitting thing, even at the time of modelling, because each algorithm in Machine Learning is based on the size of data and if you split the data which is already in small size then you are only increasing the chances of overfitting. Understand your question like this, you are in a hypothetical situation where you have a bag containing 50000 balls, your task is to paint all of them blue and apparently you have a machine that can do that for you. Now as I said above, before doing anything analyze the problem and ask some questions: What kind of balls are those and does it really matter with the kind of a ball? Is there any burst ball present, if it is then painting that thing is not valuable and will be time and resource consuming? Is there any ball already painted with that color? What if the color you are using is not all that good quality wise, say it gets weak after the ball interacts with water? You got all the answers and now you are going to proceed towards the next thing (The next part should answer your question) You split the balls into a group of 40000 and 10000, but the task is still there, you still have to paint all 50000 balls and on the same time check all the anomalies present. Well now it is up to you if you want to do the procedure first on 40000 balls and then on 10000 balls or applying the same procedure in one go. TL;DR You task is to do the Feature Selection, it doesn't matter if you do it after splitting or before splitting, the thing is that you have to do it on whole data set because lets say if a feature after splitting has very low correlation with target variable in the training set then you definitely would have to remove it from testing set even though it may be showing good correlation on testing part, it doesn't matter otherwise it will make both parts look different, or you can first remove unwanted features and then do the splitting thing, well in both cases you are doing the same thing. Hope that helps, in case of anything that you didn't able to understand in my answer do ask me. 🙂🙂
Hello sir... your videos are great. I need your small help. I am a BE IT student. Can you suggest me project idea for final year. My Domain is machine learning and deep learning and I want use some part of computer vision. This would be great help if you suggest me topic. Thanks in advance.
You are definitely going to make life of thousand students who want to be a Data Scientist . Thanks from heart Sir. Sir you inspired me a lot.
Thanks!
man you are too good, after spending my entire day looking for a solution on univariate analysis, all of a sudden i was about to go to bed then i came across your video, guess what? i did it man thanks to you. Kudos Broh you are a star
Thank you sir this session is very benificials for us.
Much thanks. I learned a lot today!
you are great sir very informative session conducted today😊
super sir I learned so much knowledge from ur videos......
Starts at 8:00
Great Explanation sir!! Appreciate your effort!!
Very informative and interested
Thanks for clearing everything sir 😊
Excellent session
Better to add this video in feature selection playlist
Thank you sir
very helpful information thanks for creating a video
Hello. For extremely large datasets like CERT r4.2 dataset, which technique would you recommend for filling NaN, missing values? For example, the 'activity' column?
very helpful
I am not able to find the continuation video of this session. Can anyone please help me!!
Can you explain how you were able to use chi2 when it’s not categorical?
Exactly same question i had. chi2 should only be used between categorical type or discrete variable but not with continuous variable. but he used it will all.
Even I had the same question
Here is a simple modification of the correlation function, this will return pair of features which are highly correlated with each other.
def correlation(df,threshold):
corr_features = set() #set allows to store none repeation items
corr_matrix = df.corr()
for i in range(len(corr_matrix.columns)): # these two loops allows use to scan the first half of the diagonal. Since, the other half is just the mirror of the first half
for j in range(i):
if abs(corr_matrix.iloc[i,j]) > threshold: # add those features that have correlation more than the threashold value
corr_features.add((corr_matrix.columns[i],corr_matrix.columns[j])) #getting name of the columns
return corr_features
Where is the second part of this video???
Is second part of the video available anywhere?
Hello, In your previous videos, you split the data into test and train and then apply the feature selection on x.train just to avoid the over fitting, but here on 22 minutes using select k best for feature selection ,you are taking the whole X ,why is it so sir? Any one
Is there any feature selection technique that can select significant input variables without converting inout columns to non numeric ?
do we need to consider threshold in the negative also eg. less than -0.50??
Chi2 used for hypothesis testing and feature selection also
Nice....Where is 2nd part feature selection?
Hi Krish, In the mutual information feature selection can we pass the null value feature
Sir whether these scores are based on p values which is actually trying to estimate the relationship between individual features and target variable (where Null Hypothesis will be NO Relationship and Alternative Hypothesis will be Relationship)
Hello sir,
Which feature selection technique to use for high dimensional data set like number of columns are 800
Hi Krish, we use chi2 test on top of categorical variables. But here we are also using them on numerical variable. Can you please explain this?
I tried ktest for feature selection in House Pricing Dataset Problem....as it had many categorical features converted into dummy variables so many columns were just dummy variables with no meaning separately....now on applying ktest...I am getting many dummy variable columns....Now how can i proceed further as including them as a separate feature makes no sense but ktest is indicating high importance
In feature selection, tutorial 6 is missing.pls share
Sir, amazing session but I didn't get notifications.
Click on the bell icon and select "All"
Continue sir
What should be done first feature selection or feature Engineering
Feature engineering and then feature selection
Feature engineering and then feature selection. Check life cycle of dats science project if you are not sure.
Sir for negative value selectkbest not work
Is feature importance similar with mutual information??
Chi2 value ranges from 0&1, however the score we get is like example1477655.677. can you please tell what is this score?
same question
Hi Krish,
I am unable to find the Live Feature engineering Day 5 in your playlist....Would you mind in sharing that?
What if be have Categorical Features
i'm challenged by the error named: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). please help me.
maybe you have missing values in your dataset
Please make a video on 1-way and 2-way anova and its implementation in python
is it always necessary to remove the collinearity or correlated feature or there are certain cases?
only if independent features are highly correlated with each other that means they are almost same so adding almost same data will not help the model and won't make the model more accurate....it's like telling the model same thing again and again which it already learnt from the previous data.
Krish please make a video on Deep Learning Classifiers
Please provide timestamps.. 🙏🏻
Should I ever log transfrom the target variable ?
in a regression problem should I transform the target variable ?
It is generally not done the constants or the weights generally scale well to reach target so not very evident though nothing is strict no in ml so...
No you can't
sir please also make some content on multivarient
Sir can't join telegram channel plz help
Install Telegram app and search for "Discussion on ML and DL by Krish Naik"
@@advocatesanthoshreddy9524 thank you 😊 I have joined
Sir, i have a question... In individual video you do train test split and compute feature selection technique on xtrain and
Then remove correlated feature from xtrain and xtest also.. Sometimes you don't do train test split and direct use df fir feature selection.. So i am confused.. Should i need to do train test split and then after calculate feature selection process or just use df for it... Please reply
Sir.. I am so much confused in this... If anyone knows the answer then reply also
... Thanx in advanced.. 🙏🙏🙏
we can do feature selection after split or before split . but there it can bs cause overfitting problem when we select the whole df , so to prevent it we do first split and apply feature selection on X_train and then when u got features we keep same features for X_test as well .
if i am doing wrong please guide meme.
Just a friendly reminder brother do not learn things hard coded, first analyze why do you need to apply a certain algorithm on a data set then check whether or not is there really the need of that algorithm to be applied on that data set?
Ask some questions to yourselves like:
Can't we achieve the results without applying it?
Is there any other, more optimized, algorithm to do the same task?
What could be the merits and demerits of applying it on our data set?
Then later after answering all of those questions, make your decision. The question you are asking is completely based on this principle, it is not necessarily true that you should apply feature selection or feature engineering before or after the splitting, you should know that the size of data will vary your correlation scores, if you have significantly less data then you should not even think of applying the splitting thing, even at the time of modelling, because each algorithm in Machine Learning is based on the size of data and if you split the data which is already in small size then you are only increasing the chances of overfitting.
Understand your question like this, you are in a hypothetical situation where you have a bag containing 50000 balls, your task is to paint all of them blue and apparently you have a machine that can do that for you. Now as I said above, before doing anything analyze the problem and ask some questions:
What kind of balls are those and does it really matter with the kind of a ball?
Is there any burst ball present, if it is then painting that thing is not valuable and will be time and resource consuming?
Is there any ball already painted with that color?
What if the color you are using is not all that good quality wise, say it gets weak after the ball interacts with water?
You got all the answers and now you are going to proceed towards the next thing
(The next part should answer your question)
You split the balls into a group of 40000 and 10000, but the task is still there, you still have to paint all 50000 balls and on the same time check all the anomalies present. Well now it is up to you if you want to do the procedure first on 40000 balls and then on 10000 balls or applying the same procedure in one go.
TL;DR
You task is to do the Feature Selection, it doesn't matter if you do it after splitting or before splitting, the thing is that you have to do it on whole data set because lets say if a feature after splitting has very low correlation with target variable in the training set then you definitely would have to remove it from testing set even though it may be showing good correlation on testing part, it doesn't matter otherwise it will make both parts look different, or you can first remove unwanted features and then do the splitting thing, well in both cases you are doing the same thing.
Hope that helps, in case of anything that you didn't able to understand in my answer do ask me. 🙂🙂
hii sir
Hello sir... your videos are great. I need your small help. I am a BE IT student. Can you suggest me project idea for final year. My Domain is machine learning and deep learning and I want use some part of computer vision. This would be great help if you suggest me topic. Thanks in advance.