- 517
- 388 766
Statistics Ninja
United States
Приєднався 21 січ 2012
Aaron Smith, Ph.D.
Mathematician/Data Scientist/Machine learner/Statistics ninja
Award-winning data scientist with experience working on a wide range of analytic projects. Uses data skills to help organizations reduce cost, be legally protected, minimize employee workload, and improve outcomes.
Mathematician/Data Scientist/Machine learner/Statistics ninja
Award-winning data scientist with experience working on a wide range of analytic projects. Uses data skills to help organizations reduce cost, be legally protected, minimize employee workload, and improve outcomes.
Performing a first-stage moderated mediation analysis (Model 7)
Performing a first-stage moderated mediation analysis (Model 7)
Переглядів: 26
Відео
05 Factorial designs principles and applications
Переглядів 67Місяць тому
05 Factorial designs principles and applications
03 Experimental Data Setup - Blocking and Stratification
Переглядів 19Місяць тому
03 Experimental Data Setup - Blocking and Stratification
01 Introduction to Experimental Design
Переглядів 83Місяць тому
01 Introduction to Experimental Design
21 Ensemble Methods in Supervised Learning - Filmed during Hurricane Milton
Переглядів 59Місяць тому
This video was filmed while Hurricane Milton was about to make landfall.
20 Model Selection in Supervised Learning
Переглядів 872 місяці тому
20 Model Selection in Supervised Learning
18 Parameter Tuning in Supervised Learning
Переглядів 553 місяці тому
18 Parameter Tuning in Supervised Learning
17 Neural Networks in Supervised Learning
Переглядів 1063 місяці тому
17 Neural Networks in Supervised Learning
16 K-Nearest Neighbors Models in Supervised Learning
Переглядів 1693 місяці тому
16 K-Nearest Neighbors Models in Supervised Learning
15 Supervised Learning with Gradient Boosting
Переглядів 933 місяці тому
15 Supervised Learning with Gradient Boosting
14 Random Forest Models in Supervised Learning
Переглядів 1104 місяці тому
14 Random Forest Models in Supervised Learning
Using a large language model for sentiment analysis
Переглядів 1644 місяці тому
Using a large language model for sentiment analysis
Using a large language model to classify topics
Переглядів 1354 місяці тому
Using a large language model to classify topics
Using a large language model for classification supervised learning
Переглядів 3494 місяці тому
Using a large language model for classification supervised learning
16 Histogram-based Gradient Boosting Regression Tree
Переглядів 2634 місяці тому
16 Histogram-based Gradient Boosting Regression Tree
13 Supervised learning with decision trees
Переглядів 565 місяців тому
13 Supervised learning with decision trees
12 Supervised learning with support vector machines
Переглядів 705 місяців тому
12 Supervised learning with support vector machines
11 Supervised learning with logistic regression
Переглядів 1096 місяців тому
11 Supervised learning with logistic regression
10 Generalized linear models in supervised learning
Переглядів 566 місяців тому
10 Generalized linear models in supervised learning
09 Feature Selection in Supervised Learning
Переглядів 2096 місяців тому
09 Feature Selection in Supervised Learning
08 Lasso, Ridge, and Elastic-Net Regression in Supervised Learning
Переглядів 1126 місяців тому
08 Lasso, Ridge, and Elastic-Net Regression in Supervised Learning
07 Linear Regression in Supervised Learning
Переглядів 1066 місяців тому
07 Linear Regression in Supervised Learning
06 Model Complexity and Generalization in Supervised Learning
Переглядів 986 місяців тому
06 Model Complexity and Generalization in Supervised Learning
05 Evaluating Classification Supervised Learning Model Quality
Переглядів 537 місяців тому
05 Evaluating Classification Supervised Learning Model Quality
04_Evaluating_Regression_Supervised_Learning_Model_Quality
Переглядів 977 місяців тому
04_Evaluating_Regression_Supervised_Learning_Model_Quality
03 Preparing data for regression supervised learning
Переглядів 1227 місяців тому
03 Preparing data for regression supervised learning
nice sir! Thanks
Thanks for the video! What if both your sample and the population are imbalanced, but to different degrees (e.g., 5:1 vs. 10:1)? Would changing class weights to reflect the population imbalance rather than the sample imbalance be a solution? If so, how does this affect the calibration of the model?
@@kalechips965 Excellent question! It depends on your project. I would alter weights to address imbalance. If you need model score to equal the true probability, I would rescale model scores so that the mean training score equals the fraction of positive training. I would not about this if you do not need to interpret model scores. Find the cutoff that gives the best sensitivity, specificity, precision, recall etc.
@statisticsninja I appreciate the tips! One more question that’s a bit more complex… As mentioned above, my sample has a 5:1 class imbalance. I've done a stratified split according to this imbalance, creating (1) a training set for model development (within which cross-validation subsets are themselves split using stratified k-folds) and (2) a test/holdout set purely for final model evaluation. For reproducibility, I have set a fixed seed variable and passed it to any method that has a "random_state" argument. There are two issues. First, my cross-validated training and test metrics are similar, but these same metrics can be more than 10% lower for my final test set. Second, multiple runs of my script using different random seeds causes the overall results to vary appreciably. I think these issues are related. I believe the relatively matched validation training/test scores reflects good capability of my models to learn from each validation subset. However, the fact that the validation test scores tend to be higher than the (unseen) holdout test scores indicates to me that these models do not generalise well. I think this points to a data issue - I would guess that the overall sample is unbalanced or biased in ways that are not captured by the class-stratified sampling. For example, some important features may be underrepresented in certain subgroups of a class, and their distribution would vary significantly across splits, affecting classifier performance. Does this interpretation make sense to you?
You are amazing, I am using R leaflet to generate linear transects on the map with individual pairs of location. Although I didn't find the solution in this video, I still tried the techniques in your video. It's amazing and help me a lot. Thank you for posting this guide. Very detailed and understandable. Best wishes!
@@wanqingtai1490 excellent!!!
Do you have a reference to any literature that you've used this in?
@@minghuachang8126 You can use the citation() function in R to get the citation information for a package. Typically it references the journal article that announced the package.
What if you get a steep negative slope line in your added variable plot?
@@nasheedjafri3564 If a variable has a steep slope positive or negative then you want to include it in your model.
THANKS for the series, it's helping me 🙏
Can you plz make a tutorial for spatial machine learning in python
@@fathymohamed4312 What type of spatial data do you have? The simplest approach would be to treat your spatial data as predictor variables. R has a lot more spatial tools than Python because R is more common in science.
@@statisticsninja thank you Sir
This is gold, thank you! I just want to ask how you conclude dimensionality of this set of items? Since PCA, EFA, and irt CFA tells you different number of factors?
@@nadimalfana That is a good question. It is a subjective decision. Try to balance the goals and constraints of your project, and what you see in the data. Make the best call you can. There is usually a range of reasonable values.
@@statisticsninja Woah, that's tough decision haha... Thanks!
this is just what I needed. thankssss!!!!!
Cheer~~~a charge or claim that someone has done something undesirable---an accusation.😅
I have a model in my task; one numerical 2 categorical variable. When I want to create a formula like you do here: formula1 = 'numerical ~ C(cat1) + C(cat2)' I can observe category one slightly less than 5% so I can reject null hypothesis however I see on another video they use one categorical variable to one numerical variable right? so formula2 = 'numerical ~ cat1' and I can observe that category one is 9% what exactly means in difference here when we use these two dependent variables in formula1 and formula2 and which formula we use ?
Hi, I liked your video tutorial, it is quick and easy to follow. May I ask how do you add your own data say number of published studies in state? How do you incorporate that into your map? Cheers!
The easiest way is to get an sf object with the spatial features and a data.frame, then merge your data.frame with merge.sf(). If you have the spatial features without a data.frame(). Then you need to match your data.frame() to the geometries.
Thanks
Thanks
Thanks
Thanks
When you use only categorical data where the options for the question are the same, you don't need to normalize the data?
For that situation the data are all on the same measurement scale, so I would not normalize. If I had data on different scales, height in inches and weight in lbs, then I normalize.
Thanks
you need to create playlist for multiple correspondence analysis
Thanks
Is it possible to generate the communalities and/or proportion of total variance explained in BEFA? Looking for some metric to assess the model fit of my model. Thank you!
I could not find a function to do this for you. You could manually compute regression diagnostics using the equation in BayesFM::befa Model specification. Use your befa to fill in everything except errors. Be sure to switch columns and signs of your befa output first.
How would I do this? The only output of BEFA are the factor loadings, variances of error terms, factor correlations and the indicator values. Is it possible to manually generate with only these values?
@@ThankfulAlways I found a better way. Use parameters::model_parameters, and parameters::efa_to_cfa to get the parameters from your befa model. Then fit a confirmatory factor analysis model using your preferred package. This will give your the full power of your favorite factor analysis package.
@@statisticsninja oh really? I will do some research and try it out. Never tried doing confirmatory factor analysis before. Not sure how it's different from exploratory factor analysis. Will read on that. Thanks a lot!
hi how can I take item information and test information parameters? can you help me for these syntaxes?
After you fit your model, enter your model into the str() function. It will print the slots of your model. You can use the slot names to extract what you need.
@@statisticsninja Thank you for reply, I hope you would share a video on how to obtain item and test information functions in multidimensional confirmatory IRT. 😀
@@MrArdahazal Which function are you using to fit your model?
@@statisticsninja Hi, I constructed a model as follows as far as I understand your presentation. Besides, I want to learn the test information function and item function parameters, but, I havent understood them. I tried to code them at below of the syntax but I am not sure. Model -------------------------------------- library(mirt) library(latticeExtra) cfa <- mirt::mirt.model( input=' pl = 1-9 sb = 10-15 db= 16-21 dk= 22-28 oy= 29-34 COV = pl*sb, pl*db, pl*dk, pl*oy, sb*db, sb*dk, sb*oy, db*dk, db*oy, dk*oy') ACS<- mirt(data=thesis, model=cfa, method= "MHRM", itemtype = 'graded', SE=FALSE, SE.type="MHRM", TOL=1e-2) ACST <- coef(ACS, IRTpars = TRUE, simplify = TRUE) options(max.print = 1000000) print(ACST, digits = 2) For the whole scale: test information matrix --------------------------------------- Theta <- matrix(seq(-4,4, by = .01)) thetas <- fscores(ACS, method = "EAP", rotate = "oblimin", QMC=TRUE) tinfo <- testinfo(ACS, thetas, degrees = c(0, 0, 0, 0, 0)) plot(thetas, tinfo, type="1") For Item 1: item information matrix ---------------------------------------- Theta <- as.matrix(expand.grid(-4:4, -4:4, -4:4, -4:4, -4:4)) iteminfo1 <- extract.item(ACS, 1) iteminfo<- iteminfo(iteminfo1, thetas, degrees = c(0, 0, 0, 0, 0), total.info= TRUE, multidim_matrix = TRUE) options(max.print = 1000000)
@@MrArdahazalI hope this helps. I posted the RMarkdown file on my website's shared files section. ua-cam.com/video/k_oNhQ9Fy6w/v-deo.html
Thank
Helpful insights - very well done. Thank you!
You are so to the point. I really believe American professors are built different!
Hey i am unsure if you will repond but my boss wants me to do multiple imputation And i never did that before. I have large dataset cleaned and manipulated . I am unsure what predictor matrix is.? Any help .. how to know which imputed dataset is better... i asked my boss that what if we use random forest... because it handle both numeric and categorical data.. is there any insights.. amy book article which will be helpful
The predictor matrix is all of your predictor predictors in a matrix or data frame. For multiple imputation, I prefer missRanger. A way to compare imputation methods would be to copy your data, ramdomly replace values with missing values, try several imputation methods, then compare the imputed values to the original values.
thank you
Much appreciated
What a great intro is sf and its functionalitys. Thank you so much and keep up the good work!
Thank you!
Can i use R for satellite remote sensing and can you advice me from where to start?
I have used R for analyzing satellite data. The project was a success. I would start by analyzing your data independentl of spatial coordinates, then analyze coordinate distribution independentl of data, perform full spatial data analysis.
t@@statisticsninja Thank you so much for your kind response and i hope to find your help in terms of guide me what course i should take to master R software analysing remote sensing data in short time as i started long time ago but unfortunitly i found i have to learn many things to be able to use R to analysis rmote sensing data .
Hello, maptools:Rgshhs is not available anymore. Is there any other code we can use?
Hi Aaron. I tried to use the list_cv section of the code on my data and strangely it created list of length 0. could you suggest what can I try? Also the length of the dataset is very small.
Make sure your data is in a data.frame and not a tibble.
I need help on fixing on R application error. Error in .local(obj, ...) Cannot derive coordinates from non-numeric matrix When we use raster:: intersect(a,b) method we are getting this error on new server but which is working fine on old R shiny server
Make sure your data frame has only numeric or integer columns and pass your data frame to to.matrix() or matrix(). Make sure you are not using a tibble
@@statisticsninja okay let me check, thanks
@@statisticsninjaWe are using a spatial polygon & point to intersect which is not a matrix. coordinates(o_yb) <- ~easting+northing #convert the locations into a SpatialPoint proj4string(o_yb) <- CRS("+init=epsg:27700") #for each order, get the ycodes which intersect the building boundaries o_yb <- do.call(rbind,lapply(o$order_id[o$in_building == 1], function(x) { #x <- 1 #testing t_bld <- bld[bld$fid %in% o_bld$fid[o_bld$order_id == x],] #get the building boundaries for the order do.call(rbind,lapply(o_yb@data$key[o_yb@data$order_id == x], function(x1) { # x1 <- "YDLHP" #testing t_o_yb <- o_yb[(o_yb@data$order_id == x & o_yb@data$key == x1),] t1_bld <- r_intersect(t_bld,t_o_yb) #check if the ycode is in the building if (length(t1_bld) == 0) return(NULL) t_o_yb@data[,c("order_id","key")] #if length is greater than 0, make a data.frame, else return null })) }))
How do you check the quality of your imputation ? I’m confuse
You can check to see if there is a statistical difference between imputed and nonimputed data; you can run anomaly detection and see if imputed records are proportionally a lot of anomalies, you can also train another model on non-imputed data to predict the imputed column and check the residuals when you predict imputed values
@@statisticsninja Thank You for your help. Do you know of the what test I could you use ? I am a passionate amateur so I have some gap that I'm trying to fill.
@@abdulbouraa4529 I would compare the preimputed column to postimputed column by comparing the histograms, and running a Kolmogorov-Smirnov test, ks.test()
@@statisticsninja Thank you very much !
Where should I put this code?
Python
I like your explanation. Im trying to use this in dataset with numerical and ordinal questions
How do I get test information from multidimensional model?
For a mirt model from the mirt package, the itemfit() and M2() functions extract test statistics. You could also use the @Fit slot from the mirt object directly.
@@statisticsninja But is there a way to evaluate the Test Information Curves? Or some analogous function?
@@andresimi Which package and function are you using to fit your model?
@@statisticsninja I am using mirt package with the plot(type = "info") function. Right now, I split a multidmensional model into unidimensional ones so the curves are interpretable. I was wondering if this is ok, or if I have another way of doing this.
Hi. Thanks for this interesting video. Please, I performed Fisher's exact test on a 4*2 table in SPSS, and I got a significant difference (P= 0.010) and I wonder what is the post hoc test to use following that? is it the adjusted standardized residuals? and if I want to calculate the P value from each adjusted standardized residuals, how can I do that?
You can look at 2x2 subtables and run hypothesis tests to identify which conditional distributions differ from the Fisher exact test null hypothesis. Consider using a p-value correction such as Bonferroni. If your variables have an independent-dependent variables relationship, you can run chi-squared test on pairs of conditional distributions. The standardized residuals will show which cells are most different from the Fisher null hypothesis.
Thank you again
Thank you
can you show to us the data in csv file?
I posted the .dat files in the shared files section of my website. You can load them the same way you would load a .txt file
Thanks!
Very helpful video! Could you please briefly explain how to get the factor score in BEFA? Thanks!
If you set save.lvs = TRUE when you fit your model, blavaan::blavPredict() with type = "lvmeans" will give the factor score of your fitted data. I could not get blavaan::blavPredict() nor predict() to work with newdata. For new newdata, extract coefficient estimates and multiply the observed values by their coefficients
Very helpful! Thank you very much for posting this!
Hello! I cannot seem to find the dove dataset on your website - is there any way I could find it elsewhere? Thanks kindly!
The data set is on the book's website asdar-book.org The homepage has links to each chapter's data
@@statisticsninja Wow, thank you so much! I really appreciate the quick reply and your fabulous videos!
Hey Aaron, love your videos, thanks so much for the content! Do you have a Github where we can find your code?
This man is trying so hard to make Survival analysis not sound morbid lol
Thanks
Thanks
Could you give us your Jupyter notebook?
I posted my R markdown file to the shared files section of my website.