🚀 Data Cleaning/Data Preprocessing Before Building a Model - A Comprehensive Guide

Поділитися
Вставка
  • Опубліковано 9 лют 2025
  • Welcome to Learn_with_Ankith! 📊 In this tutorial, we'll delve into the crucial steps of data preprocessing to ensure your datasets are in prime condition before feeding them into your machine learning models. A clean and well-prepared dataset is the foundation for accurate and reliable model predictions.
    Data_set link: www.kaggle.com...
    📌 Topics Covered:
    🚀 Data Cleaning/Data Preprocessing Before Building a Model - A Comprehensive Guide
    Import Necessary Libraries: Learn the essential libraries required for efficient data manipulation and analysis.
    Read File: Understand how to import data from various sources and formats into your Python environment.
    Sanity Check:
    Identify and handle missing values effectively.
    Explore the dataset's shape, information, and spot duplicates.
    Conduct a garbage check to maintain data integrity.
    Exploratory Data Analysis (EDA):
    Dive into descriptive statistics for a deeper understanding of your data.
    Visualize data distributions with histograms and box plots.
    Uncover patterns and relationships with scatter plots and correlation heatmaps.
    Missing Value Treatment:
    Implement strategies using mode, median, and KNNImputer to handle missing data.
    Outlier Treatment:
    Explore methods to detect and deal with outliers that can impact model performance.
    Encoding of Data:
    Convert categorical variables into a format suitable for machine learning algorithms.
    🔧 Whether you're a beginner or seasoned data scientist, mastering these preprocessing techniques is fundamental for building robust and accurate machine learning models..#DataPreprocessing, #DataCleaning, #MachineLearning, #DataScience, #DataAnalysis, #PythonProgramming, #Tutorial, #ExploratoryDataAnalysis, #OutlierDetection, #MissingValueTreatment, #DataVisualization, #Programming, #DataManipulation, #CodingTips, #FeatureEngineering, #DataQuality, #Pandas, #NumPy, #Matplotlib, #Seaborn, #DataInsights, #TechTutorial, #DataEngineering, #MachineLearningModels, #AIProgramming, #DataAnalytics, #DataWrangling, #TechEducation, #PythonTips, #Statistics, #DataSkills, #ProgrammingLife, #Algorithm, #TechTalk, #CodingCommunity, #DataPrep, #CodeNewbie, #DataQualityCheck, #LearnDataScience, #ProgrammingJourney

КОМЕНТАРІ • 90

  • @chimdiihenacho4985
    @chimdiihenacho4985 2 дні тому +1

    This is the best tutorial I have come across as a machine learning student. This has given me the entry I needed to get shit done.. Thanks a lot Ankith

  • @gloomyday4524
    @gloomyday4524 9 місяців тому +59

    you dont know how much this video help clueless students like me, you did such a good thing bro, i hope everything will always goes easy in your life!

  • @mustaphakwari4536
    @mustaphakwari4536 11 днів тому

    I am truly at a loss for words to express the value of this tutorial. It is incredibly insightful, educational, and highly informative. A perfect roadmap for beginners. My sincere appreciation to the presenter for such a fantastic session!

  • @ContehAlimamy-p1k
    @ContehAlimamy-p1k Місяць тому +4

    you forgot one step step 8: Normalization. who else notice in the video. Thank you so much for the video.

  • @amina._.1862
    @amina._.1862 2 місяці тому +4

    I like the layout, very professional and shows exactly each process (what it is) step by step tysm

  • @kiruthickagp
    @kiruthickagp Рік тому +4

    Very clearly explained

  • @nithyarajan7317
    @nithyarajan7317 11 днів тому

    So much details & good explanation sir .. Thank you so much for the video

  • @bhaskarmondal7461
    @bhaskarmondal7461 Рік тому +2

    Thank you so much Sir,
    For providing this particular Kind of tutorial!, which is specifically targeted for Machine Learning rather than Data Analysis. Also, I was looking for something just like this for last few days

  • @AmahaGebretsadikan
    @AmahaGebretsadikan 10 місяців тому +1

    I like it the organisation and contents of the presentation

  • @mitchellyula4447
    @mitchellyula4447 6 місяців тому +3

    Thank you for this walkthrough. This will help me on my next project for school.

  • @bombasticiti
    @bombasticiti Рік тому +1

    Nice, Thank you for feeding my mind!🙂

  • @yasink18
    @yasink18 8 місяців тому +5

    Thank you so much for making simple video ..
    Can you make more video on just handling different outliers type and how to understand only what type of outliers we need to handle or ignore

  • @AB51002
    @AB51002 Рік тому +6

    Could you also make a video exploring and cleaning text data? Something like what LLMs train on, but obviously much smaller. Something like 1GB of text perhaps. I can't find any online resources targeting that specifically, and it could help many people learn how to better filter text dataset for higher quality datasets. Thank you in advance!

  • @Akash-us3mo
    @Akash-us3mo 10 місяців тому +1

    Thankyou

  • @FrancisNzubechukwu-qe6mj
    @FrancisNzubechukwu-qe6mj Місяць тому

    Thanks for the free lesson💌

  • @Vaishnavi31-b4z
    @Vaishnavi31-b4z 3 місяці тому +1

    Awesome tutorial bro!! Thanks!!

  • @stefan5249
    @stefan5249 4 місяці тому

    Hi, well structured turtorial. Systematicallly for understanding what to do in a first data inspection. Thank you!

  • @rekharekha-ll4cp
    @rekharekha-ll4cp 2 місяці тому

    Excellent explanation, now only i understood the preprocessing

  • @SonaliDey-e5u
    @SonaliDey-e5u 4 місяці тому

    Thanks bro for your informative video. This video saved me from such a mess which I was not able to understand

  • @Dadepegba
    @Dadepegba 3 місяці тому

    I love your lesson, you explain very clearly. Thank you.

  • @kelvinau3606
    @kelvinau3606 Місяць тому

    Thanks for the video, brother, love it

  • @anurag17091977
    @anurag17091977 9 місяців тому +2

    stupendous video. keep it up bro.

  • @vrishabhbhonde6899
    @vrishabhbhonde6899 9 місяців тому +2

    Thanks a lot sir. Very helpful and very clear steps

  • @minalgupta7456
    @minalgupta7456 16 днів тому

    nice explanation

  • @thaisbraz9092
    @thaisbraz9092 Місяць тому

    Thank you so much for this video❤

  • @ayshafida3413
    @ayshafida3413 2 місяці тому

    Superb video

  • @Ciiads
    @Ciiads 5 місяців тому +1

    good job👌👌❤❤

  • @percidaman4409
    @percidaman4409 9 місяців тому +1

    Thanks man this was so great, you really helped me

  • @AmaRan31
    @AmaRan31 5 місяців тому

    Thanks a lot for this video!

  • @hey_hae
    @hey_hae 5 місяців тому

    very clear explanation thank u!

  • @Shahzaib_Aqeel
    @Shahzaib_Aqeel 2 дні тому

    @Ankith Kindly Share the notebook as well please.

  • @umerahmed7062
    @umerahmed7062 25 днів тому

    while filling the missing values you also filled the life expectancy you previously said that the Life expectancy shouldn't be touched etc . I think you have performed the work which you said to avoid

  • @mukhammedabusuveilim3468
    @mukhammedabusuveilim3468 2 місяці тому

    Overall very a good video. Would've been great if you add specific section for continuous and categorical data types. Another point, I don't understand why you showed the correlation matrix if you didn't use it to filter out highly correlated features (there a couple that were fully correlated and I assume some that were highly correlated).

    • @learnwithankit383
      @learnwithankit383  2 місяці тому

      Thanks for the feedback. While this video focused on the initial cleaning steps, the correlation matrix is often crucial for feature selection.

  • @vaibhavchaganti1709
    @vaibhavchaganti1709 День тому

    sir why you Missed Normalization ? (step 8)??

  • @fasiowaizahmed4641
    @fasiowaizahmed4641 5 місяців тому

    great video !!

  • @alfredturkson1319
    @alfredturkson1319 8 місяців тому +1

    How did you set up your jupyter notebook? the settings to make mine look like yours please

  • @AtomicPixels
    @AtomicPixels 11 місяців тому +1

    You can skip literally every step here by uploading your data to hugging face and opening the auto train data viewer tool that’s auto generated for you. It includes the answers to all of these problems already with no code or time spent making it a task you don’t need to be focused on

  • @adritaadi8027
    @adritaadi8027 15 днів тому

    can i do this on kaggle? following the same steps?

  • @melissameeker3189
    @melissameeker3189 7 місяців тому

    Thank you so much you helped me understand

  • @abdulbasithassan1742
    @abdulbasithassan1742 2 місяці тому

    good brother

  • @onlyguitars
    @onlyguitars Рік тому

    Hi! Great video, very helpful and love how each step is clearly outlined! Just a question. In the outliers why change the value to the UW and LW, and not just drop those rows? Thank you!

  • @SalmanKhan-e9u3e
    @SalmanKhan-e9u3e 6 місяців тому

    Thank you so much sir

  • @mr.malluclasher
    @mr.malluclasher 3 місяці тому +1

    sir.. do i need to fix the skewness before encoding and scaling?

  • @priyankakasturia
    @priyankakasturia 3 місяці тому

    Can I use interpolation instead of mean or median if I have time series data with missing numeric values?

  • @minalgupta7456
    @minalgupta7456 16 днів тому

    upload more projects related to the data scientist

  • @Mission_Satyuga
    @Mission_Satyuga 8 місяців тому

    Nice vedio thanks brother ❤

  • @nabinbk1065
    @nabinbk1065 8 місяців тому

    thank you sir. you are great

  • @md.sowmik7944
    @md.sowmik7944 2 місяці тому

    Best

  • @khushboo4743
    @khushboo4743 6 місяців тому

    Is there any video of machine learning model of this data

  • @aliloreno
    @aliloreno 3 місяці тому

    2:39 imports

  • @akhandsingh6497
    @akhandsingh6497 6 місяців тому

    Thanks for this video and I want to ask you that how you can get run time in Jupiter notebook pl tell me

  • @Balaji-wb7cp
    @Balaji-wb7cp 8 місяців тому

    Superb bro

  • @roshanbhattad4493
    @roshanbhattad4493 4 місяці тому

    Hi Ankith, thanks for the turorial. I do have a question can we do missing value treatment before EDA ?

  • @mohitjoshi8984
    @mohitjoshi8984 Рік тому

    Hello
    Help in correlation part it showing NaN and 0.0
    Please help

  • @rekhamalik3663
    @rekhamalik3663 Рік тому

    Amazing!
    Can you please make video with complex json files i.e stock market data?

  • @RajenderKumarG
    @RajenderKumarG 5 місяців тому

    at 34:30 BMI is not working. After replacing inplace = True by false and removing BMI, it is working. Please help

    • @LaxmiKumari-ru8lu
      @LaxmiKumari-ru8lu 4 місяці тому

      Same.. Have you found any solution of this problem?

    • @asma5202
      @asma5202 3 місяці тому

      add a space befor and after , like this " BMI "

  • @yasinimudy8688
    @yasinimudy8688 9 місяців тому

    Nice video, however I would like if ".fit_transform" method of KNNImputer does not cause data leakage when applied to fill null values.

  • @muhammadsamir2243
    @muhammadsamir2243 8 місяців тому

    Please share the notebook link

  • @gayathrikrishnamoorty4243
    @gayathrikrishnamoorty4243 9 місяців тому

    what will we do if we find duplicates in dataset??

  • @raghavendraraodk7855
    @raghavendraraodk7855 8 місяців тому

    Sooper

  • @devanshupatnaik_video6387
    @devanshupatnaik_video6387 8 місяців тому

    Is this is data cleaning method??

  • @pra438
    @pra438 4 місяці тому

    Please provide notes also

  • @bhushansonawane5915
    @bhushansonawane5915 7 місяців тому

    Hello sir, how can i connect with you ? Need urgent help please

  • @cryptofile4002
    @cryptofile4002 7 місяців тому

    @Learn with Ankith can you pls offer the code for this?

  • @DivyaChindam-h3p
    @DivyaChindam-h3p 7 місяців тому

    Normalization?

  • @jglez6868
    @jglez6868 Місяць тому

    I want to add something.. when you are dealing with missing values, lets say for the polio column. You should replace those value for the mean of polio of the corresponding country, if you do the mean overall you might get a slightly different value then let say find the mean of polio in Yemen and replace it by it.. So its always good to think of ways to not generalize much and replace by more specific realistic data

  • @amanagrawal1976
    @amanagrawal1976 8 місяців тому +1

    Pls provide jupyter notebook code

  • @shanthalaxmikumar4931
    @shanthalaxmikumar4931 4 місяці тому

    Thank u for this ,,, can you please tell us in case of date data?

  • @nguyenthiyenhuong2344
    @nguyenthiyenhuong2344 10 місяців тому

    where is Normalization? pls

  • @ayushjaiswal350
    @ayushjaiswal350 7 місяців тому

    okay video

  • @mayfield7835
    @mayfield7835 6 місяців тому +1

    700th like

  • @poiuytrewq-d8e
    @poiuytrewq-d8e 10 місяців тому +1

    WORTH VARMA WORTH

  • @prabhatkumar-0145
    @prabhatkumar-0145 Рік тому

    provide a csv file also

    • @learnwithankit383
      @learnwithankit383  Рік тому +1

      www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

  • @iizrael
    @iizrael 8 місяців тому

    Please how can I install pandas and the rest to my notebook because mine is showing me error if I try importing as you did yours

    • @learnwithankit383
      @learnwithankit383  8 місяців тому

      Try to execute : !pip install pandas in Jupyter Notebook.

  • @lilaclove1709
    @lilaclove1709 9 місяців тому

    🙂

  • @bevg1
    @bevg1 Рік тому

    slow down a bit...

  • @iShowSweat-iest
    @iShowSweat-iest Рік тому

    adding code script to next time, please

  • @jesuslopez6873
    @jesuslopez6873 4 місяці тому

    always an Indian...