Getting a random sample from your pandas data frame

Python and Pandas with Reuven Lerner

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 10 січ 2025

КОМЕНТАРІ • 31

@madhurakhaire6583 2 роки тому ⁺¹
Very useful for my Masters data science dissertation as I'm working with tremendously large dataset, thanks a lot!!!
@ReuvenLerner 2 роки тому
I'm delighted to hear it!
@alaaeltayeb5794 2 роки тому ⁺¹
very helpful thank you so much , your teaching skills are fantastic and smooth
@ReuvenLerner 2 роки тому
Thanks so much for your kind words!
@ImBatmanYT_CODM 2 роки тому ⁺¹
Hello there, I had few doubts related to random sample generation (having some sampling logic(10%) which covers the every unique user in the given data set), where I could assign the generated samples further to 'n' users! I know what I'm asking here is quite basics, but I couldn't find anything relatable over lot. Can you kindly help? (This is basically for generating audit sampling from a CSV file)
@ReuvenLerner 2 роки тому ⁺¹
I'm afraid that I don't know much about the random-sample mechanism in Python. I assume that it's documented, and that you can choose which kind of random sampling you want to do... but that's about as far as my knowledge goes, I'm afraid!
@oueslatinihel6071 2 роки тому ⁺¹
mister thank you for this explanation it was very helpful but i need to ask if i have a csv file and i want to utilise exactly 1/4 of the dataset to train my model and i dont want it to be random what should i do !!! thank youu
@ReuvenLerner 2 роки тому
Try reading the file in chunks (i.e., set chunk_size), and then stop after you've read one chunk. That seems like the easiest thing.
@atifdai313 7 місяців тому ⁺¹
I am using the yearly data....Suppose my data is showing 33 rows and 20 columns (20 columns also including the years (1999 to 2022) in my summary stat analysis. How can I exclude the year's column from my whole analysis? OR I should delete the year's column. Please guide us further regarding any data shape command.
@ReuvenLerner 7 місяців тому
You can remove one or more columns with df.drop. If you want to remove all rows in a particular range, then you will likely want to use a boolean index to indicate what you do or don't want, and then apply it to the data frame. There isn't room here to explain that, but look for my video about "boolean indexing made simple" that explains it more.
@nadjagomes4854 2 роки тому ⁺¹
Thank you for sharing your knowledge! Is there a way of choose randomly just one variable from an specific column?
@ReuvenLerner 2 роки тому
My pleasure!
When you ask for a random sample, you're getting a random row (or several random rows) from the data frame. When you say that you want "just one variable," what are you referring to -- a specific column?
@spaceadvanture6458 Рік тому ⁺¹
Does the sample represents the actual population, i mean if I train model using sample data set will it be also correct for actual population
Is it good practice to train model on samples?
@ReuvenLerner Рік тому
When it comes to machine learning, you're always training on a sample of the data. However, you normally don't want a totally random sample, because you want to make sure that all of the different possibilities are taken into account. If I train my model on a random sample of people, it's possible that I'll only get men above the age of 70. Which means that the model will be broken for anyone outside of that demographic. For that reason, stratified sampling is usually better for models -- and there's a whole field of expertise (which isn't me!) that talks about how to build your sample so that it's truly representative and can be used to extrapolate to the general population.
@avibis6509 2 роки тому ⁺¹
Thank you for the information sir. But how to exclude variables less than or equal to zero (different kind of sample)?
@ReuvenLerner 2 роки тому
First, filter the rows, so that you end up with a data frame containing only those you want. But then you have a new data frame -- on which you can still ask for a random sample!
@biglicha 2 роки тому ⁺²
Thanks! It was very usefull for My homework.
@AJAYVAIDSstudent 10 місяців тому ⁺¹
“In this world, no one teaches random sampling as clearly as you.”
@ReuvenLerner 10 місяців тому
Thanks so much for your kind words!
@kartik1396 2 роки тому ⁺¹
How does your file location autocompletes after using ~ before courses?
@ReuvenLerner 2 роки тому
If you press the "tab" key in Jupyter, it tries to complete identifiers (i.e., variables, functions, and classes), attributes (after a dot) and filenames (in certain contexts). It doesn't always work perfectly, but it does tend to work pretty well.
@eijo19 2 місяці тому ⁺¹
This is so helpful, thanks!
@ReuvenLerner 2 місяці тому ⁺¹
Glad you found it useful!
@l8870 2 роки тому ⁺¹
Is there any way to proof that python random sampling is indeed random? From statistical perspective
@ReuvenLerner 2 роки тому
I'm sure that there is - but that's way beyond my expertise. I tend to trust the Python core developers, and how they implemented the "random" module. I'm sure the documentation describes what kind of random sampling they're doing.
@l8870 2 роки тому
@@ReuvenLerner hi, so in statistic you could prove your random sampling method is indeed have a good randomness by using runs test or Wald-Wolfowitz. My proffesor just taught me this by the end of homework discussion ...
@umarabdullah1697 2 роки тому ⁺²
Nice explanation
@monome3038 Рік тому ⁺¹
thank youuuu!
@ReuvenLerner Рік тому
Glad you enjoyed it!
@ahmadjaradat3011 2 місяці тому ⁺¹
Very helpful thank you so much , your teaching skills are fantastic and smooth
@ReuvenLerner 2 місяці тому
I'm delighted to hear it helped!

Наступне

Автоматичне відтворення