Getting a random sample from your pandas data frame

Поділитися
Вставка
  • Опубліковано 26 лют 2022
  • Working with Python's pandas library for data analytics? If your data set is very large, you might sometimes want to work with a random subset of it. The "sample" method is perfect for that. In this video, I demonstrate the ways in which you can use the "sample" method on your data frames to get back precisely the number (or fraction) of rows you want.
    Jupyter notebooks for all of my videos are at github.com/reuven/youTube-not....
    And don't forget to sign up for my free, weekly Python newsletter at BetterDevelopersWeekly.com/!
  • Наука та технологія

КОМЕНТАРІ • 27

  • @madhurakhaire6583
    @madhurakhaire6583 2 роки тому +1

    Very useful for my Masters data science dissertation as I'm working with tremendously large dataset, thanks a lot!!!

  • @alaaeltayeb5794
    @alaaeltayeb5794 2 роки тому +1

    very helpful thank you so much , your teaching skills are fantastic and smooth

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому

      Thanks so much for your kind words!

  • @biglicha
    @biglicha Рік тому +2

    Thanks! It was very usefull for My homework.

  • @umarabdullah1697
    @umarabdullah1697 2 роки тому +2

    Nice explanation

  • @nadjagomes4854
    @nadjagomes4854 Рік тому +1

    Thank you for sharing your knowledge! Is there a way of choose randomly just one variable from an specific column?

    • @ReuvenLerner
      @ReuvenLerner  Рік тому

      My pleasure!
      When you ask for a random sample, you're getting a random row (or several random rows) from the data frame. When you say that you want "just one variable," what are you referring to -- a specific column?

  • @KakarotSsjg
    @KakarotSsjg 2 роки тому +1

    Hello there, I had few doubts related to random sample generation (having some sampling logic(10%) which covers the every unique user in the given data set), where I could assign the generated samples further to 'n' users! I know what I'm asking here is quite basics, but I couldn't find anything relatable over lot. Can you kindly help? (This is basically for generating audit sampling from a CSV file)

    • @ReuvenLerner
      @ReuvenLerner  Рік тому +1

      I'm afraid that I don't know much about the random-sample mechanism in Python. I assume that it's documented, and that you can choose which kind of random sampling you want to do... but that's about as far as my knowledge goes, I'm afraid!

  • @oueslatinihel6071
    @oueslatinihel6071 2 роки тому +1

    mister thank you for this explanation it was very helpful but i need to ask if i have a csv file and i want to utilise exactly 1/4 of the dataset to train my model and i dont want it to be random what should i do !!! thank youu

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому

      Try reading the file in chunks (i.e., set chunk_size), and then stop after you've read one chunk. That seems like the easiest thing.

  • @monome3038
    @monome3038 8 місяців тому +1

    thank youuuu!

  • @avibis6509
    @avibis6509 Рік тому +1

    Thank you for the information sir. But how to exclude variables less than or equal to zero (different kind of sample)?

    • @ReuvenLerner
      @ReuvenLerner  Рік тому

      First, filter the rows, so that you end up with a data frame containing only those you want. But then you have a new data frame -- on which you can still ask for a random sample!

  • @kartik1396
    @kartik1396 2 роки тому +1

    How does your file location autocompletes after using ~ before courses?

    • @ReuvenLerner
      @ReuvenLerner  2 роки тому

      If you press the "tab" key in Jupyter, it tries to complete identifiers (i.e., variables, functions, and classes), attributes (after a dot) and filenames (in certain contexts). It doesn't always work perfectly, but it does tend to work pretty well.

  • @spaceadvanture6458
    @spaceadvanture6458 8 місяців тому +1

    Does the sample represents the actual population, i mean if I train model using sample data set will it be also correct for actual population
    Is it good practice to train model on samples?

    • @ReuvenLerner
      @ReuvenLerner  8 місяців тому

      When it comes to machine learning, you're always training on a sample of the data. However, you normally don't want a totally random sample, because you want to make sure that all of the different possibilities are taken into account. If I train my model on a random sample of people, it's possible that I'll only get men above the age of 70. Which means that the model will be broken for anyone outside of that demographic. For that reason, stratified sampling is usually better for models -- and there's a whole field of expertise (which isn't me!) that talks about how to build your sample so that it's truly representative and can be used to extrapolate to the general population.

  • @user-sp7wh5mw5k
    @user-sp7wh5mw5k 4 місяці тому +1

    “In this world, no one teaches random sampling as clearly as you.”

    • @ReuvenLerner
      @ReuvenLerner  4 місяці тому

      Thanks so much for your kind words!

  • @l8870
    @l8870 Рік тому +1

    Is there any way to proof that python random sampling is indeed random? From statistical perspective

    • @ReuvenLerner
      @ReuvenLerner  Рік тому

      I'm sure that there is - but that's way beyond my expertise. I tend to trust the Python core developers, and how they implemented the "random" module. I'm sure the documentation describes what kind of random sampling they're doing.

    • @l8870
      @l8870 Рік тому

      @@ReuvenLerner hi, so in statistic you could prove your random sampling method is indeed have a good randomness by using runs test or Wald-Wolfowitz. My proffesor just taught me this by the end of homework discussion ...

  • @atifdai313
    @atifdai313 Місяць тому +1

    I am using the yearly data....Suppose my data is showing 33 rows and 20 columns (20 columns also including the years (1999 to 2022) in my summary stat analysis. How can I exclude the year's column from my whole analysis? OR I should delete the year's column. Please guide us further regarding any data shape command.

    • @ReuvenLerner
      @ReuvenLerner  Місяць тому

      You can remove one or more columns with df.drop. If you want to remove all rows in a particular range, then you will likely want to use a boolean index to indicate what you do or don't want, and then apply it to the data frame. There isn't room here to explain that, but look for my video about "boolean indexing made simple" that explains it more.