Hello there, I had few doubts related to random sample generation (having some sampling logic(10%) which covers the every unique user in the given data set), where I could assign the generated samples further to 'n' users! I know what I'm asking here is quite basics, but I couldn't find anything relatable over lot. Can you kindly help? (This is basically for generating audit sampling from a CSV file)
I'm afraid that I don't know much about the random-sample mechanism in Python. I assume that it's documented, and that you can choose which kind of random sampling you want to do... but that's about as far as my knowledge goes, I'm afraid!
mister thank you for this explanation it was very helpful but i need to ask if i have a csv file and i want to utilise exactly 1/4 of the dataset to train my model and i dont want it to be random what should i do !!! thank youu
I am using the yearly data....Suppose my data is showing 33 rows and 20 columns (20 columns also including the years (1999 to 2022) in my summary stat analysis. How can I exclude the year's column from my whole analysis? OR I should delete the year's column. Please guide us further regarding any data shape command.
You can remove one or more columns with df.drop. If you want to remove all rows in a particular range, then you will likely want to use a boolean index to indicate what you do or don't want, and then apply it to the data frame. There isn't room here to explain that, but look for my video about "boolean indexing made simple" that explains it more.
My pleasure! When you ask for a random sample, you're getting a random row (or several random rows) from the data frame. When you say that you want "just one variable," what are you referring to -- a specific column?
Does the sample represents the actual population, i mean if I train model using sample data set will it be also correct for actual population Is it good practice to train model on samples?
When it comes to machine learning, you're always training on a sample of the data. However, you normally don't want a totally random sample, because you want to make sure that all of the different possibilities are taken into account. If I train my model on a random sample of people, it's possible that I'll only get men above the age of 70. Which means that the model will be broken for anyone outside of that demographic. For that reason, stratified sampling is usually better for models -- and there's a whole field of expertise (which isn't me!) that talks about how to build your sample so that it's truly representative and can be used to extrapolate to the general population.
First, filter the rows, so that you end up with a data frame containing only those you want. But then you have a new data frame -- on which you can still ask for a random sample!
If you press the "tab" key in Jupyter, it tries to complete identifiers (i.e., variables, functions, and classes), attributes (after a dot) and filenames (in certain contexts). It doesn't always work perfectly, but it does tend to work pretty well.
I'm sure that there is - but that's way beyond my expertise. I tend to trust the Python core developers, and how they implemented the "random" module. I'm sure the documentation describes what kind of random sampling they're doing.
@@ReuvenLerner hi, so in statistic you could prove your random sampling method is indeed have a good randomness by using runs test or Wald-Wolfowitz. My proffesor just taught me this by the end of homework discussion ...
Very useful for my Masters data science dissertation as I'm working with tremendously large dataset, thanks a lot!!!
I'm delighted to hear it!
very helpful thank you so much , your teaching skills are fantastic and smooth
Thanks so much for your kind words!
Hello there, I had few doubts related to random sample generation (having some sampling logic(10%) which covers the every unique user in the given data set), where I could assign the generated samples further to 'n' users! I know what I'm asking here is quite basics, but I couldn't find anything relatable over lot. Can you kindly help? (This is basically for generating audit sampling from a CSV file)
I'm afraid that I don't know much about the random-sample mechanism in Python. I assume that it's documented, and that you can choose which kind of random sampling you want to do... but that's about as far as my knowledge goes, I'm afraid!
mister thank you for this explanation it was very helpful but i need to ask if i have a csv file and i want to utilise exactly 1/4 of the dataset to train my model and i dont want it to be random what should i do !!! thank youu
Try reading the file in chunks (i.e., set chunk_size), and then stop after you've read one chunk. That seems like the easiest thing.
I am using the yearly data....Suppose my data is showing 33 rows and 20 columns (20 columns also including the years (1999 to 2022) in my summary stat analysis. How can I exclude the year's column from my whole analysis? OR I should delete the year's column. Please guide us further regarding any data shape command.
You can remove one or more columns with df.drop. If you want to remove all rows in a particular range, then you will likely want to use a boolean index to indicate what you do or don't want, and then apply it to the data frame. There isn't room here to explain that, but look for my video about "boolean indexing made simple" that explains it more.
Thank you for sharing your knowledge! Is there a way of choose randomly just one variable from an specific column?
My pleasure!
When you ask for a random sample, you're getting a random row (or several random rows) from the data frame. When you say that you want "just one variable," what are you referring to -- a specific column?
Does the sample represents the actual population, i mean if I train model using sample data set will it be also correct for actual population
Is it good practice to train model on samples?
When it comes to machine learning, you're always training on a sample of the data. However, you normally don't want a totally random sample, because you want to make sure that all of the different possibilities are taken into account. If I train my model on a random sample of people, it's possible that I'll only get men above the age of 70. Which means that the model will be broken for anyone outside of that demographic. For that reason, stratified sampling is usually better for models -- and there's a whole field of expertise (which isn't me!) that talks about how to build your sample so that it's truly representative and can be used to extrapolate to the general population.
Thank you for the information sir. But how to exclude variables less than or equal to zero (different kind of sample)?
First, filter the rows, so that you end up with a data frame containing only those you want. But then you have a new data frame -- on which you can still ask for a random sample!
Thanks! It was very usefull for My homework.
“In this world, no one teaches random sampling as clearly as you.”
Thanks so much for your kind words!
How does your file location autocompletes after using ~ before courses?
If you press the "tab" key in Jupyter, it tries to complete identifiers (i.e., variables, functions, and classes), attributes (after a dot) and filenames (in certain contexts). It doesn't always work perfectly, but it does tend to work pretty well.
This is so helpful, thanks!
Glad you found it useful!
Is there any way to proof that python random sampling is indeed random? From statistical perspective
I'm sure that there is - but that's way beyond my expertise. I tend to trust the Python core developers, and how they implemented the "random" module. I'm sure the documentation describes what kind of random sampling they're doing.
@@ReuvenLerner hi, so in statistic you could prove your random sampling method is indeed have a good randomness by using runs test or Wald-Wolfowitz. My proffesor just taught me this by the end of homework discussion ...
Nice explanation
thank youuuu!
Glad you enjoyed it!
Very helpful thank you so much , your teaching skills are fantastic and smooth
I'm delighted to hear it helped!