Tutorial 11-Exploratory Data Analysis(EDA) of Titanic dataset

Поділитися
Вставка
  • Опубліковано 5 жов 2024
  • Here is the detailed explanation of Exploratory Data Analysis of the Titanic. Finally we are applying Logistic Regression for the prediction of the survived column.
    Github url: github.com/kri...
    References from : Jose Portila EDA Materials And Kaggle
    ⭐ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite for a few months and I love it! www.kite.com/g...
    Stats playlist : • Population vs Sample i...
    You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python
    Packt url : prod.packtpub....
    Amazon url: www.amazon.com...

КОМЕНТАРІ • 288

  • @aakritiroy7336
    @aakritiroy7336 4 роки тому +71

    After so much of struggle with my LMS, I was finally able to understand entire EDA in within 30 minutes. Thank you.🙏👍

  • @VVV-wx3ui
    @VVV-wx3ui 5 років тому +24

    Doing a job that of True Guru, Ekalavyas are all around and raring for such knowledge-impartation. Thanks much Krish.

  • @Esha25ghosh
    @Esha25ghosh 4 роки тому +14

    You are awesome sir! Not only are you a great mentor, but also a great motivator. Thanks for all the great work you have been doing. Stay blessed!

    • @chaos8514
      @chaos8514 2 роки тому

      I am learning this for data analyst but not sure what more should I learn to get job asap.. if you can help please we can connect on instagram

  • @classicemmaeasy2292
    @classicemmaeasy2292 2 роки тому +1

    Me trying to understand data analysis with python couple of days ago now
    U actually make it simpler and beginners friendly, more unction to function sir

  • @sowjanyadharmavarapu2653
    @sowjanyadharmavarapu2653 3 роки тому +10

    sir i really liked your video.. but according to road map video, you asked us to watch python 1-24 lectures first..in this eda concept, you have mentioned some new words like get_dummies, and few other new words.. stuck with the last 10 mins explaination.. else everything is really clear and understandable.. thanks for all the efforts...

    • @dynamictechnocrat
      @dynamictechnocrat Рік тому

      Get dummy are use in pandas

    • @ashridas9896
      @ashridas9896 Рік тому

      It is basically one - hot encoding..
      Encoding techniques are used to convert categorical data into numerical data
      Since it is applied on 'Embarked' column
      ua-cam.com/video/OTPz5plKb40/v-deo.html

  • @aayushshukla342
    @aayushshukla342 2 місяці тому

    Loved the video; in fact, the entire playlist gives an amazing approach to the intricacies of Machine Learning. Thank you, Sir.

  • @aination7302
    @aination7302 4 роки тому +9

    Both imputing and dropping missing values (NaN) is not a good practice with real world data. The ideal way is to derive a new field indicating missing values. 1 for missing else 0. because, sometimes missing value can be a new information in itself.
    just sharing some learning from my job :)

    • @okonvictor8711
      @okonvictor8711 2 роки тому

      Hi please do you mind sharing how to do that here. Or can I reach you via email?

    • @waqarmehdi4394
      @waqarmehdi4394 2 роки тому

      Yes, it depends upon the dataset and problem you want to solve. In this case, dropping the null value is the best possible option in my opinion.

  • @souvikdas3905
    @souvikdas3905 4 роки тому +3

    What a beautiful video for a beginner who is just getting his hands on data science.

  • @thePrabhuChannel
    @thePrabhuChannel 4 роки тому +30

    21:30 Median of the passenger age travelling in each Pclass can be calculated using below code instead of looking at boxplot and guessing the number.
    df[df['Pclass']==1]['Age'].median()
    df[df['Pclass']==2]['Age'].median()
    df[df['Pclass']==3]['Age'].median()

    • @viveksingh881
      @viveksingh881 3 роки тому +2

      good one brother i was thinking the same y to guess it when we can actually calculate it,....

    • @tusharmahuri2439
      @tusharmahuri2439 3 роки тому

      There is a error comes when I want to use sns.countplot. And the error is "could not interpret input 'survived' "

    • @yashikaarora8573
      @yashikaarora8573 2 роки тому +1

      @@tusharmahuri2439 bro copy the heads from the data set and not just type, the language is case sensitive
      it is 'Survived' and not 'survived'

  • @aliakbarrayhan6389
    @aliakbarrayhan6389 4 роки тому +3

    Sir I'm very impressed to see your such amazing video.. Though I am very weak in programming but now I feel like that i should start my programming journey again cause i have someone like u who can explains anything in very simple way

  • @sunnychandra5064
    @sunnychandra5064 5 років тому +5

    You have actually cleared the EDA concept for me, Thanks a lot !!

    • @ShivamChaudhary-jn4kw
      @ShivamChaudhary-jn4kw 8 місяців тому

      why 0 and 1 is taken in cols as the indexing of the column is 2 and 5 then why 0 and 1 is taken can you clear

  • @sudeeprajput1830
    @sudeeprajput1830 3 роки тому +1

    You are amazing brother. Your videos are helping me gain confidence in ML. Keep up the good work

  • @VengalraoPachavaedu
    @VengalraoPachavaedu 5 років тому +3

    I have seen some of your videos, excellent work. I really appreciate your work Mr. Krish Naik.

  • @imranullah7355
    @imranullah7355 3 роки тому +1

    Thanks a lot Sir... You've expailed it in a great way... Love from Pakistan

  • @PiyushSingh-cq2xv
    @PiyushSingh-cq2xv 3 роки тому

    This is one of the best data set being used to understand how to fix the nulls. Great Job and thank you .

  • @vital4statistix
    @vital4statistix 3 роки тому

    Krish, This material is FIRST CLASS. Appreciate it very much.

  • @ManishKumar-gg2vm
    @ManishKumar-gg2vm 5 років тому +6

    awesome explain ...........I really can't stop myself to comment on this video...……...on of the grt video on data visualization

  • @MrKmdmustaq
    @MrKmdmustaq 5 років тому +6

    Can u please make a video on treating the outliers, this will help us a lot in solving the problems

  • @akanshabhandari1062
    @akanshabhandari1062 3 роки тому

    Very helpful..... U did a lot of hard-work for us.... Thnk u so much sir🙌🙌🙏🙏..... And ur way of teaching is very good that is form basic

  • @sulaimankhan8033
    @sulaimankhan8033 3 роки тому

    Krish - Thank you for the EDA,
    Throw some light on Story Telling - If you had to conclude the EDA, Theorotically, In lay man terms - we must do the story telling- Correct me If I am wrong .

  • @tumul1474
    @tumul1474 5 років тому +1

    this is beyond amazing....amazing place to learn and to revise the impn techniques

  • @naveenrawat6505
    @naveenrawat6505 3 роки тому

    great video :)
    i have a suggestion
    we can drop PassengerId to increase the accuracy score because it doesn't contribute to the dependent variable

    • @tusharmahuri2439
      @tusharmahuri2439 3 роки тому

      @naveen rawat
      There is a error comes when I want to use sns.countplot. And the error is "could not interpret input 'survived' "

    • @naveenrawat6505
      @naveenrawat6505 3 роки тому

      @@tusharmahuri2439show me the line of code

  • @RajatSharma-ct6ie
    @RajatSharma-ct6ie 5 років тому +1

    Great work sir, learning a lot from your videos, please upload more videos on EDA..

  • @diprajkadlag
    @diprajkadlag 2 роки тому

    one note, in boxplot the middle line inside the box is median value, not the mean value

  • @premkishanmishra1574
    @premkishanmishra1574 9 місяців тому

    loved your video , far better than the uni teachers :P

  • @girishmahamuni1830
    @girishmahamuni1830 3 роки тому

    Thank you for providing knowledge in a simple way.

  • @theayodejipopshow
    @theayodejipopshow Рік тому

    This video is amazing. Thanks so much for sharing your wealth of knowledge.

  • @pravinmore434
    @pravinmore434 4 роки тому

    Thanks a lot for the very detailed lesson Sir.. that was really fruitful and helped me complete one of my project. Thanks a ton..

  • @subhamsaha2235
    @subhamsaha2235 3 роки тому

    One correction Sir-- In the boxplot, them middle line is the median(50% percentile). Thank you

  • @muhammadbilalanwar6429
    @muhammadbilalanwar6429 4 роки тому +1

    A very good about EDA but one thing i must mention that you didnt even touch the outliers concept. Its the major part of EDA and honestly i take this video only for outliers . But didnt find .

  • @vinothv8514
    @vinothv8514 5 років тому +2

    Nice work Mr. Krish...... It's really helpful

  • @garvitjain4106
    @garvitjain4106 3 роки тому

    @Krish You are doing an amazing job.

  • @rupeshnandanyadav8108
    @rupeshnandanyadav8108 2 роки тому

    Awesome tutorial on Exploratory Data Analysis ❤️❤️

  • @mssnal
    @mssnal 3 роки тому

    Great one Krish. Basically covers most of the things a beginner needs to understand.

  • @brainfuck007
    @brainfuck007 4 роки тому +9

    You are a gem! Making india learn ML. Thank you for all the stuff you do for us. :)

  • @buzzfeedRED
    @buzzfeedRED 8 місяців тому

    @Krish : Arrange your Complete ML playlist videos into a roadmap playlist, from start to end : to data scientist

  • @GauravVerma-jk6cf
    @GauravVerma-jk6cf 3 роки тому

    this was really one of the most usefull stuff avialable !!!!!!!!!!!!!!!

  • @yashaskumargb3827
    @yashaskumargb3827 2 роки тому

    Sir play list is best
    But please share the link from which u downloaded dataset fir every vedio
    So that we can do what u explained in vegio

  • @pepetisiddhardha9848
    @pepetisiddhardha9848 3 роки тому +2

    I didnt understood why categorical features disappeared in training data for logistic regression

  • @jagadeeshabburi570
    @jagadeeshabburi570 3 роки тому

    kind of fantastic video bro, but it needs 2-3x watch for crystal clear understanding.

  • @KimJennie-fl3sg
    @KimJennie-fl3sg 4 роки тому +5

    20:20 hey, uhmm.. 50% percentile gives us MEDIAN of the age of people with 1st class... So we are using MEDIAN value instead of MEAN right?
    Very helpful video for me to understand EDA

    • @sharathkumar8422
      @sharathkumar8422 4 роки тому

      You're right, 50%ile is the median. I think you should check out the definition of median and percentiles on this page - www.statisticshowto.com/probability-and-statistics/percentiles-rank-range/#:~:text=The%2050th%20percentile%20is%20generally,quartiles%20is%20the%20interquartile%20range.
      That should clear your doubt.

  • @babupatil2416
    @babupatil2416 4 роки тому +1

    Hi Krish,
    Please create some more videos on EDA, it will be helpful.

  • @saylisuryawanshi3989
    @saylisuryawanshi3989 4 роки тому

    great job sir, please do make more such videos for practising for beginners .

  • @gkmadhav
    @gkmadhav 3 роки тому +4

    Is there a part 2 and 3 for this video, about feature engineering on the same dataset?

  • @umeshrbaidya
    @umeshrbaidya 4 роки тому +4

    Great video Sir, I just have two doubts that why did you not use get_dummies on "Pclass" as it was also categorical data.. and second why did you not normalize the "Fare" and "Age" Columns as their values are might over power the results?

    • @bharathb3946
      @bharathb3946 4 роки тому +1

      Same doubt bro

    • @harshmakwana8001
      @harshmakwana8001 4 роки тому

      If you type "train.info( )" you will see thae dtypes of all the columns. I don't know if this might help or not but get_dummies( ) can be used for objects only i think as they do not represent any numerical value for the system to compute get_dummies( ) changes indicates those objects into numerical values. Please correct me if i am wrong as i am also confused about this if you agree or have a different insight on this please tell me so.

  • @vinayaksharma6349
    @vinayaksharma6349 4 роки тому +8

    sir how you get to know the age age has relation with pclass (how and which analysis you did?)

    • @ashishmeher216
      @ashishmeher216 4 роки тому

      @Vinayak sharma you can relate any column with any other column.

    • @SravanKumar-td5im
      @SravanKumar-td5im 3 роки тому

      You could do a heat map of all features and get their correlation according to which you can know which feature is dependent on what

  • @honey9111
    @honey9111 4 роки тому +1

    Thanks a lot Kris. EDA was well explained. I could not understand the last part starting from confusion matrix and how to read the final result of the analysis?

  • @AshishRoy
    @AshishRoy 2 роки тому

    Very nicely explained. Awesome

  • @MrDeeb00
    @MrDeeb00 10 місяців тому

    Hi, Enable auto subtitle, It helps a lot.
    Thank you.

  • @venkatadeviprasadkankanala7387
    @venkatadeviprasadkankanala7387 4 роки тому

    Very nice one thank you very much for sharing valuable information

  • @samudragupta719
    @samudragupta719 5 років тому +5

    Sir One question always revolves always in my mind that how should we remember all the libraries and syntaxes that are needed to Preprocess the data or doing the visualization stuffs??! It would be grateful if you share your strategies regarding that?!

  • @sung3898
    @sung3898 4 роки тому

    The middle line in box plot is not average but it's a median.

  • @ifhamaslam9088
    @ifhamaslam9088 4 роки тому

    Superb explanations..
    And interesting to learning

  • @samyakkumarsahoo8706
    @samyakkumarsahoo8706 3 роки тому

    It was a resourceful video.
    But why EDA is done before train-test split ?

  • @arpitkakkar2780
    @arpitkakkar2780 3 роки тому +1

    I think there is no need for "passengerId" to be included in the model. It should be dropped as well.

  • @lavanyameesa6432
    @lavanyameesa6432 3 роки тому

    wonderful explaination

  • @mohammadbariyawala2420
    @mohammadbariyawala2420 3 роки тому +1

    If not able to import logistic regression with "from sklearn.model_selection import LogisticRegression", then try "from sklearn.linear_model import LogisticRegression". this will work on new version of sklearn.

  • @shubhamthapa7586
    @shubhamthapa7586 4 роки тому +1

    i have a question why is he not using SimpleImputer class from scikit learn
    instead of finding the realtion to make the nan values having some values
    we can easily do it through sklearn module
    and also why isnt he using label encoder for binary values ???

  • @tusharikajoshi8410
    @tusharikajoshi8410 Рік тому +1

    hey @Krish! Should we do this data visualization for each and every column? or we do it after feature selection? if we are supposed t do for each column, wouldn't the code get to big and complex for data with hundreds or thousands of features?

  • @aasthasingh67
    @aasthasingh67 3 роки тому +1

    How do you know for one kind of result, which plot to use exactly?

  • @abhinavmahajan448
    @abhinavmahajan448 3 роки тому

    Thanks for the detailed video. Really helpful :)

  • @saifkhan4541
    @saifkhan4541 4 роки тому

    Thankyou sir it is very helpful 😊.

  • @mohamedshathik8045
    @mohamedshathik8045 2 роки тому

    Hi krish,
    You didn't drop the passenger ID column before fit the logistic regression model cause it doesn't contain any information.

  • @pandian3731
    @pandian3731 4 роки тому

    Another great video very useful one bro like NLP.. 📍

  • @naveenrawat6505
    @naveenrawat6505 3 роки тому

    loving the playlist :)))))

  • @adeniyi5875
    @adeniyi5875 Рік тому

    I like the video, but how did you know exactly the graphical representation to use, i mean why countplot why not jointplot? Why line plot not boxplot?
    I hope you really understand my questions sir

  • @ashishgoyal7020
    @ashishgoyal7020 3 роки тому

    Thank you Krish.

  • @ganeshrao405
    @ganeshrao405 3 роки тому

    Really helpful, Thank you soo much.

  • @pedrocrespo2681
    @pedrocrespo2681 3 роки тому

    Pretty nice explanation !

  • @rishabhnegi1937
    @rishabhnegi1937 2 роки тому

    wish..... Jack and Rose could also see this data analysis

  • @mustafaraza6107
    @mustafaraza6107 Місяць тому +1

    16:15 now we have displot() ---- [without t]

  • @abhishekts740
    @abhishekts740 11 місяців тому

    Please upload video related time series analysis

  • @adityakhullar3735
    @adityakhullar3735 5 років тому

    Awesome. Just one thing that you haven't removed passenger_id from the training dataset otherwise the accuracy of the model would have been 80%.which is good.

    • @akshaygoyal2134
      @akshaygoyal2134 4 роки тому

      Are you sure, because I got only 0.75119 scores on doing the same thing shown in the video and removing the passengerId

  • @shaikhanuman8012
    @shaikhanuman8012 4 роки тому

    brother can clearly exaplain what are the dependent and independent features and how we solve the problem

  • @unnatiraut9553
    @unnatiraut9553 2 роки тому

    Great to understand. thanks alot

  • @dipeshlimaje8998
    @dipeshlimaje8998 2 роки тому

    sir im confuse coz we are predicting survival so it is 0 and 1 which means means its a categorical data and we r solving with regression

  • @bipulnath4602
    @bipulnath4602 5 років тому

    if possible upload more video on EDA..thanks Krish

  • @devanshusharma9386
    @devanshusharma9386 5 років тому

    very helpful for beginners

  • @121horaa
    @121horaa 4 роки тому +1

    Sir, I didn't get why you compensated the missing value of age with the average age of Pclass?
    Can't we simply replace the NaN values with the median values of the age column as: train['Age']=train['Age'].fillna(train['Age'].median())

    • @glenn8781
      @glenn8781 3 роки тому +2

      In practical reality, every person has an age value but that data is missing for some people in the titanic dataset. Our goal is not just to fill in any random age where the age is missing but to fill in an educated guess/ estimate of the missing age of a person so that it can be a close representative of the true ages of those people. Of course, like you mentioned, the median of the entire age column could be used as an estimate but would that be a good representative value for ALL missing ages? Some people would have ages far above or below the median age. So on further exploration we notice that the median age for each Passenger Class is different, which would mean that in reality, people from a certain p-class would more likely be of a certain age, than someone who belongs to another p-class. And this difference is considerable (37 vs 29 vs 24). So by using using p-class to estimate age, we're just making a more educated guess for missing age values. You could of course go several steps further and consider other factors (like maybe SibSp, Parch etc.) in order to get a higher probability age value.

  • @hemapriyaelumalaipalani3752
    @hemapriyaelumalaipalani3752 4 роки тому +1

    Great video Krish. One doubt- how did you find the correlation between pclass and age before creating the box plot?

    • @joelbraganza3819
      @joelbraganza3819 3 роки тому

      Use ANOVA test for finding relationship between variance of each class-group of the categorical variable and the mean of the continuous variables associated with each group.

  • @MuhammadAwais-n2b
    @MuhammadAwais-n2b 22 дні тому

    3:37 Add hahahaha Great learning Exp love you brother

  • @maheshvenkat9956
    @maheshvenkat9956 4 роки тому +1

    It could have been nice if def function had been explained in detail

  • @kkckk4360
    @kkckk4360 5 років тому +1

    can please make the video on hypothesics testing in stats

  • @ShubhamJain-in6sz
    @ShubhamJain-in6sz 4 роки тому

    Great work sir!!👍🏻👍🏻

  • @ds-hy9nc
    @ds-hy9nc 4 роки тому +1

    when i try to apply my functinon (23:20)it is showing unexpected EOF while parsing

  • @balajiabhi9039
    @balajiabhi9039 3 роки тому

    @Krish Naik what is that test size =0.30 why did u use that .from beginiinng of video everything was very good but in the end i couldn't understand x train ytrain test size whats that accuracy 0.7190 etc. please tell me sir else your efforts will go waste ...

  • @RahulRoy-qy8rk
    @RahulRoy-qy8rk 4 роки тому

    This was so helpful. Thank You

  • @josephtolentino1900
    @josephtolentino1900 3 роки тому +2

    I wish I found this the first time around

  • @abdullahkidwai7222
    @abdullahkidwai7222 2 роки тому

    I am getting key error after executing the following code:
    sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=40)
    Any suggestion/idea as to what is to be done to stop getting this error?

  • @louerleseigneur4532
    @louerleseigneur4532 3 роки тому

    Thanks Krish

  • @aradhyakanth8409
    @aradhyakanth8409 2 роки тому

    Sir, what is the need to visualise the data in this problem. You haven't use any analysis extracted from the visualisation to get help out in data cleaning.

  • @ShivamVerma-gq2sm
    @ShivamVerma-gq2sm 4 роки тому

    Sir, no concrete explanations were given during last 4mins and I found it very difficult to understand . I looked for those terms and tried to understand those from other videos and at last to complete this sole video I spent nearly 2 hours on other channels .
    You said that your tutorials on ML would give a comprehensive understanding of this subject. I am in doubt, if I should continue watching your tutorials or not?

  • @kalpatarusahoo1820
    @kalpatarusahoo1820 4 роки тому

    Krish. Can you explain while data cleaning, why the passenger class is compared with Age and not any other columns. Big doubt of mine

  • @LearnwithNaviOfficial
    @LearnwithNaviOfficial 7 місяців тому

    @krish Naik we drop the age column then how again age column occur

  • @classicgd
    @classicgd 4 роки тому

    Hi Krish thanks for the videos... do you have a playlist explaining all algorithms ?

  • @gangasekar3224
    @gangasekar3224 2 роки тому

    Instead of mayplot lib and seaborn can we use powerbi

  • @Sab_Moh_Maya_Hal
    @Sab_Moh_Maya_Hal 4 роки тому

    very knowledgeable,thanks man :)

  • @anahitasaxena9439
    @anahitasaxena9439 8 місяців тому

    why did you decide to analyse age with respect to Pclass in the missing value stage ?

  • @Parshant17
    @Parshant17 3 роки тому

    Are you sure that is average in boxplot near 20th mintue? Because when we talk about percentile then 50%ile should be median.

  • @joelbraganza3819
    @joelbraganza3819 3 роки тому

    Why do we need to get dummy variables for binary class variables like Sex and Embark, and why didn't we treat the variable pclass with One-hot-encoding, is it because we are treating it as ordinal, but wouldn't it cause problems with linear-regression and DNN algorithms to apply over it? Let me know Sir. Thanks.