Support Vector Machine (SVM) with R - Classification and Prediction Example
Вставка
- Опубліковано 26 лис 2024
- Includes an example with,
brief definition of what is svm?
svm classification model
svm classification plot
interpretation
tuning or hyperparameter optimization
best model selection
confusion matrix
misclassification rate
Machine Learning videos: goo.gl/WHHqWP
Becoming Data Scientist: goo.gl/JWyyQc
Introductory R Videos: goo.gl/NZ55SJ
Deep Learning with TensorFlow: goo.gl/5VtSuC
Image Analysis & Classification: goo.gl/Md3fMi
Text mining: goo.gl/7FJGmd
Data Visualization: goo.gl/Q7Q2A8
Playlist: goo.gl/iwbhnE
svm is an important machine learning tool related to analyzing big data or working in data science field.
R is a free software environment for statistical computing and graphics, and is widely used by both academia and industry. R software works on both Windows and Mac-OS. It was ranked no. 1 in a KDnuggets poll on top languages for analytics, data mining, and data science. RStudio is a user friendly environment for R that has become popular.
Dear Sir, Thank u very much for the video and code. I can say I learned ML and r coding using your tutorial much more than udemy, lynda, and other works. Good Job. Your channel is the best indeed!
You are most welcome!
#Learning From "Support Vector Machine (SVM) with R-Classification and Prediction Example
#准备工作,加载数据,并看一下数据的分布
data("iris")
str(iris)
library(ggplot2)
qplot(Petal.Length, Petal.Width, data=iris, color=Species)
#第一步:运行SVM,选择合适的Kernel方法
library(e1071)
mymodel=svm(Species~., data=iris, kernel = "polynomial")
#-------将mymodel的Kernel方法改为radial,linear,也可以改为polynomial
summary(mymodel)
#第二步:Tuning,即超平面优化,选择最佳模型
set.seed(123)
tmodel=tune(svm, Species~., data = iris, ranges = list(epsilon =
seq(0,1,0.1), cost = 2^(2:9)))
#-------seq生成一个序列,0开始,1结束,中间相隔0.1,一共11位数;
#-------cost取值为2到9,一共8位数,11x8=88个参数模型,如果数据很大,则需要很久
plot(tmodel)
summary(tmodel)
#第三步:选择最佳的模型,并作图
mymodel=tmodel$best.model
summary(mymodel)
plot(mymodel, data = iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 3, Sepal.Length = 4))
##Petal.Width~Petal.Length,定义谁是X,谁是Y
#第四步:计算预测能力
##Confusion Matrix and MisClassification Error
pred=predict(mymodel, iris)
tab = table(Predicted = pred, Actual = iris$Species)
tab #tab用来查看预测的结果
1-sum(diag(tab))/sum(tab) #计算预测失败的概率
Not sure about your question.
@@bkrai Thanks. It's the R code for this video.
Thumbs up !!
most of your tutorials are pretty useful.
you have a good knack of explaining complicated techniques in a simplified way.
Thanks for the feedback!
I am an avid subscriber of yours. Your videos are simply outstanding and very helpful for self study. Thank you very much for your videos and all the hard work.
Thanks for feedback and comments!
Your tutorials are priceless. Thank you for sharing your knowledge. This was easy to understand and to the point.
Thanks for comments!
Thank you again for these complete episodes. You have been of a great help to me "Rai". Please, I'd appreciate a complete episode on the ensembles, essentially, heterogeneous ensemble using DT, SVM etc. inclusive as the base classifiers.
Comprehensive videos on ensembles are not common, in fact, I haven't come across any. It will go a long way If you could put something together on this. Thank you for your help!
Thanks for the suggestion, I'll do it in near future!
Sir will you please explain me what does Cost, gamma and radial means and what they do? Also explain me Radial and Sigmoid. I'm sorry too many questions I have asked but since you always help me to understand the concept clearly it's my request. Thank you Sir.
you're the best teacher ever
Thanks for your comments!
I see that many videos say let us predict and use the predict command. What are you trying to predict? What is the output is being expected?
Why when you used the slice function you set Sepal.Width = 3 and Sepal.Length = 4 ? Is this just for convenience since they are the last two variables that need to be accounted for? Are these the boundaries that are created when you created the graph?
Thank you so much. A great explanation of the SVM model.
You are welcome!
Sir your teaching is excellent please post some videos on how handle semi supervised machine learning algorithm in R especially in case of SVM
Thanks for the suggestion!
Excellent Session sir on SVM...Very Useful
Thanks!
Hello Dr. Rai, Thanks for your great tutorials. I shoud say I learnt ML and r coding using your tutorial much more than udemy, lynda, and other works. Good Job. Your channel is the best indeed! I suggested to all my frineds!
I was wondering that would you teach us some machine learning in python?
Thanks for your comments! I'll plan to do python in few months.
Incredible explain sir....plz made a video list of parametric and non parametric test..as early as possible
Thanks for the suggestion!
Very nice video. Easy to understand. Appreciated your effort.
Thanks for comments!
Thank you for your made simple and easy to follow video tutorials. You are awesome!
Thanks for your feedback!
Thank you so much for your wonderful videos!
There is one question about this video, that is , when using the function "tune", it always says that "Error in if (tunecontrol$cross > n) stop(sQuote("cross"), " must not exceed sampling size!") :
argument is of length zero"
Have searched for solutions and tried to convert the data used to a list but still did not work.
Would you please suggest how to fix it?
Thank you!
I saw this today, probably by now you must have addressed this.
One Word --- Awesome , Thanks Sir..
Welcome!
Thank you Dr. Rai. This video was really helpful and entertaining.
You are welcome!
So you only used the Petal length and width to do the svm test and ignored the Spetal characteristics ? Or did they affect the algorithm ?
The others can be tried in the same way.
1. While ploting the model at 4:06, why did u choose "Petal.Width~Petal.Lenght"? Is it because these variables have low correlation?
2. Also what is the reason to select Sepal.Width = 3 and Sepal.Length = 4? Is it because while using these values we see a better classifier while plotting the model?
I found this
From ?plot.svm
slice a list of named numeric values for the dimensions held constant (only needed if more than two variables are used). Dimensions not specified are fixed at 0.
In other words, when visualising the effect of predictor variables on the response you can specify which other predictor variables are to be hold constant (i.e. at a fixed value).
So in your example, you're visualising the effect of the predictor variables Petal.Length and Petal.Width on the response while keeping Sepal.Width and Sepal.Length constant at the specified values
Many thanks sir,thank you!I have a question for you. In the following statement: "mymodel
It's because of 2D plot only 2 variables can be accommodated.
@@bkrai Thanks for the answer
@@bkrai Sir, you have assigned constant values for other variables. how you have decided those constant values sir?
Thank you Mr. Rai for this excellent demonstration and explanation of SVM.
Regards.
thanks for feedback!
Sir, may i know why sepal length and sepal width assigned with constant values. that means we can't plot model with more than 2 variables. if I have assign constant values, how to decide the constant values like you have assigned 3 and 4. suppose I have used boruta algorithm for variables selection before running SVM model. i got 5 variables out of 10 variables as important. then how to plot SVM model. please help me by replying to my comment
Hi Sir, I wrote a few articles and those are saying SVC is for binary classification, if we need to analyse a multiclass classification, we have to use eith OneVSOne or OneVsRest method, but in this video I can see, you haven't selected any one of them, is this library take care this matter by itself?? can you please explain this....regards
You can refer to the documentation provided for the library for more details about multiclass-classification approach used:
cran.r-project.org/web/packages/e1071/e1071.pdf
Thank you sir
Hi why are you doing the typical training and test data in this case?
That can be easily done here too.
hello sir, can you provide some sources for SVR code for regression in Matlab as I want to optimize the hyperparameters using meta-heuristic algorithms
Unfortunately I don't use matlab.
brilliant, brilliant, brilliant sir.....request= can you do one please for regression
Thanks, I've added it to my list.
thankyou sir, can you please share the link
Here is the link:
drive.google.com/open?id=0B5W8CO0Gb2GGc1ZZQWhmMmpuWWc
Okay , if i got the model ... how can i do to get an equation to for example use it in an application ? i mean to reproduce the classification results without R ? Thank you
Sir,as kernel changes number of support vector change.Can this number be measure of accuracy of the model?
For accuracy you should use info in the confusion matrix.
Thankyou Sir, This tutorial was quite useful but I am trying to create a user-defined function for SVM analysis in which I can define the data set kernel, and other parameter for the data set in function calling. How can I do that ?
Hi! Excellent tutorial! all very clear.. I have a data set with four columns only, these are location, duration, date and time. I implemented the svm model for prediction, but all predicted values are incorrect. How can I approach date and time? I did normalize the data but still prediction rate is bad.
If one of the variables is date/time related, I would say use time series. Facebook recently open sourced its time series forecasting package. Here is the link:
ua-cam.com/users/edit?o=U&video_id=7xDAYa6Ouo8
Hi! thank you, but the link is pointing to an empty page of youtube.
Here is the correct link:
ua-cam.com/video/7xDAYa6Ouo8/v-deo.html
Very nice video to watch during my exam preparations! The music would be nicer if it was maybe 50% of the volume at any point where you are talking. Otherwise well explained and great to watch :)
Thanks for the tip!
@@bkrai epsilon doesn't seem to have any effect of the results when I use tune like you do. But I found that another example used "gamma" instead of "epsilon" for another model and that had an effect on SVM for me (surprisingly). Do you know why it's like that?
Thanks for the wonderful session on SVM. I have a question regarding how did you choose value for epsilon , cost for the tuned model. If it is a trial and error method, I would like to know how did you end up getting that.
The best values are chosen by the model itself from the range that we provide.
yes I agree that sir. But how did you come up with this range. it looks like the optimal value is entirely depends on the range which we provide. is that right?.
Yes I agree sir. But how did you come up with that range. It looks like that the optimum value for cost & epsilon is entirely depends on range we provide. Is that right sir?.
For epsilon the range has to be between 0 and 1. So you can try 0.1 increments. If the plot suggests further fine-tuning, you can even try 0.05 or 0.01 increments. For cost default value is 1. And as mentioned in the video, you need to try very wide range and that's why we have used 2^2 etc. For most situation this approach will help you to get best values for these parameters. The idea is to have very wide range for both so that you don't miss the best values.
oh fine sir.
Very well explained and very useful!
Thanks!
Hi, Rai thanks for this clear lecture. But I have a question: I follow the exactly same steps as yours, but when use tune function, I get a different result from you. I get the best parameter: cost 4 (instead of 8 as yours), the best performance 0.04 (instead of yours 0.033). But all the steps i just exactly the same with you. Do you have any idea why it happened?
Very good content Sirji!
Sir how to used the best model for testing data set ?
Instead of iris data with the model, you can use test data.
@@bkrai Thanks Sirji
welcome!
Thank you very much, please can you give me how to downsampling And oversampling the positive data samples to avoid data imbalance
Here is the link:
ua-cam.com/video/Ho2Klvzjegg/v-deo.html
Hello sir, in the 14 line from script (4.56 mins in vedio) we have slice, how to select the values in it and if many variables are the in the data, should we take SVM seperately between two variables each time?
This is what slice represents - "a list of named values for the dimensions held constant (only needed if more than two variables are used). The defaults for unspecified dimensions are 0 (for numeric variables) and the first level (for factors). Factor levels can either be specified as factors or character vectors of length 1."
In the video we used values that are more reasonable than default zero.
Very clear and helpful. Thank you sir!
Welcome!
Very good explanation! Instantly subscribed to your channel.
Thanks for comments!
super sir, here there is clear separation but "cleveland heart" from UCI is complex and have lot of overlapping...
That's right. And for data that have lot of overlapping, it is always a good idea to try more methods.
Sir for large sample value what could be the value of epsilon and cost..
Is there any way to extract varibale importance in SVM ?. If so could you please suggest how to do that. Thanks
You can try feature extraction using the link below before doing svm:
ua-cam.com/video/VEBax2WMbEA/v-deo.html
Dr. Bharatendra Rai Thanks.
god blesses you sir. You are the best and much appreciate!!!
Thanks for comments!
on tuning im getting this error..please help sir...Error in do.call(method, c(list(train.x, data = data, subset = train.ind[[sample]]), :
'what' must be a function or character string
>
Dr, do you have any numeric svm (regression) tutorial?
Not yet.
very informative and well explained.
Thanks for your comments!
Sir , can u explain the inutution for three classes what is going on, as u explained for the two classes..on e hyperplane is drawn between two classes ..if the third class is there how does it separate
Many thanks again for your amazing video.
Can you let me know how we evaluate the variables?
Such as we have 10 variables but only 5 of them are significant (for ex; in logistic regression, we evaluate them by P-value and OR (95%CI)).
Some said that we use weight to evaluate them, every variable has its weight, the higher the weight, the more signficant.
And can you give me the code for that?
Sir your videos are excellent and very easy to understand...!! Can you please post a video on regression models using SVM and ANN? That would be a great help in understanding the differences in results and validation parameters observed by using same algorithms. Thank you.
For ANN, you can use:
ua-cam.com/video/SrQw_fWo4lw/v-deo.html
@@bkrai Yes sir... I had already went through that video but I wasn't able to perform that with my data. That's why I'm requesting you for the same.
Sir, for discrete independent variables, can we use them as factors model?
Yes, should work fine.
Excellent video!! Thanks for sharing.
Thanks for comments!
is there anything i can do to get the size of every specie? i get the number of support vectors alright but it doesn't show the distribution... and also, i have 38 variables... how do i plot the graph for all of them?
I did this
tuned_model
Hi sir,
Can you please explain the significance of the parameters epsilon!
Regards
It affects the number of support vectors.
Thank you very much from the bottom of my heart.
You are very welcome!
Thanks. Why the iris data is not partitioned to train and test in this tutorial?
I did it to keep length of the video small. But data partitioning should be done for all machine learning methods.
@@bkrai Thanks Sir.
welcome!
Hello sir! This was very helpful thank you so much.. Can you please tell me how to split the data into train and test because I didn't understand quite well how you split the data here.. Or if there is a link to w pervious tutorial.. Thank you so much
@@asmam-k7150 You can see ua-cam.com/video/RLjSQdcg8AM/v-deo.html
Thanks for your video!! How to calculate AIC and BIC in SVM?
very wonderful and useful
i have a problem in install package in R can you help me
the problem is [ unable to install packages (default library 'c:/program files/r/r-3.4.3/library' is not writeable)]
probably you can restart RStudio and retry installing the package.
thanks
Sir what does slice =list (sepal. Width=3,sepal.length=4 ) indicates?
This is what slice represents - "a list of named values for the dimensions held constant (only needed if more than two variables are used). The defaults for unspecified dimensions are 0 (for numeric variables) and the first level (for factors). Factor levels can either be specified as factors or character vectors of length 1."
In the video we used values that are more reasonable than default zero.
Sir,
I am unable to understand this line:
slice = list(Sepal.Width = 3, Sepal.Length = 4))
What is the use and why 3 and 4?
Does SVM capture the nonlinear interaction effects across variables when using RBF?
That's correct.
SVM separate those factor levels like a cluster? If it is so why are having those many vectors?
It's outcome of the algorithm and depends on type of data.
I cannot understand why do we use slice ?Could you please explain more about it.
Hello. Thanks for your videos. I was wondering that could you teach us about genetic programming in R if there is any? Thanks
Thanks for the suggestion, I;ve added this to my list.
Sir, how to identify the important variables in SVM when we have a set of variables?
Sir I have one question. Why didn't you divide the data into train and test.
Since it was already a part of many videos, I try to focus just on SVM. But you are right, it's always better to partition the dataset.
Can you show us in other video how to do the support vector regreesion with a dataset with many variables? It will be great
thanks for the suggestion, I've added it to my list.
My data is qualitative it contains all variables are categorical...is svm applicable to my data??
Try random forest.
Every time I try to plot after running the SVM model
> plot(SVM Model name, data = data file name, Y axis variable~X axis variable)
I get this error:
> Error in Summary.factor(c(26L, 20L, 50L, 29L, 33L, 43L, 29L, 9L, 3L, 10L, :
‘min’ not meaningful for factors
How do I correct this error?
Instead of factor, use a numeric variable.
@@bkrai But Dependent variable is binary , so I have to say factor, isnt it? Even in your video, species is factor.
Hello sir! This was very helpful thank you so much.. Can you please tell me how to split the data into train and test because I didn't understand quite well how you split the data here.. Or if there is a link to w pervious tutorial.. Thank you so much
Here is a link that has more details:
ua-cam.com/play/PL34t5iLfZddspfUiv-9EaOVNUG64_fwFq.html
Thank you 😁
welcome!
Hi Bharatendra, I am trying to run SVM model on dataset with 15 features and the label is binary, it looks something like this
y_test$SurveyYes
I would suggest try and use the same format as shown in the video.
Thanks a lot Dr. Rai for uploading this tutorial. I would like to apply this SVM method to calculate a susceptibility index able to be plotted in ArcGIS, so I need to know the predicted values of the dependence variable:
1. How can be calculated?
2. Can I use for that the same coding as in the case of neural network?
Thank you very much
Hii, I am also facing a similar issue. I have developed the model using the training dataset and tested it. But I am not sure how to import the developed model in ArcGIS to apply it to the actual raster layers!!
Can you help me out?
You are really great Sir!!!!
Thanks for comments!
hi sir we need svm treat binary database on java would help us with this?
thank so much for this video sir....can i apply this to a Raster image (i.e., Array) and could you please share the R script as well sir
it depends on what type of data you have, no harm in trying. Here is the link to R code:
drive.google.com/open?id=0B5W8CO0Gb2GGc1ZZQWhmMmpuWWc
Ok sir, thanks sir..... do u also have videos on KNN, Naive bayes and R codes for ROC, PCA and Multiple linear regression
@@bkrai thank you so much guru ji
sorry typo in the previous question, for discrete independent variables, can we use them as factors in our model
Factor variables are usually of "nominal" type. For definitions you can use this link:
ua-cam.com/video/1hF0x7WsVOI/v-deo.html
graet video sir.. sir can u make a video on Taylors diagram.
Thanks for comments and suggestion!
Its very pretty, sir please share the link of R script
how qplot done
if we more number of variable then what can I use qplot
In a scatter plot, we can only have two numeric variables at a time. If you have more variables, select two most important and see if they are helping to classify response or not.
My dataset is multi variable how can i apply svm on it, can u help me??
What do you mean by multi variable? Does it mean more than one variable? If yes, then you should have no problem applying svm.
@@bkrai R is telling me "all arguments must have the same length" how can I solve this problem ?
hi how to work with high frequency data with SVM, thanks
From high frequency data you can extract features and then use svm.
Sir, I am getting the following error. could you say what can be done
> plot(mymodel, data = iris,
+ Petal.Width~Petal.Length,
+ slice = list(Sepal.Width = 3, Sepal.length = 4))
Error in `[.data.frame`(expand.grid(lis), , labels(terms(x))) :
undefined columns selected
I see a typo in Sepal.length = 4
use "L" in length.
sir, cost function = should it always start from 2 or we can have 3 to the power of ?
with 2 square, we start at cost value of 4 and then go to 8, 16, etc.. With 3 square, it will start at 9 and then jump to 27, 81, etc. But you can try it and see if it helps or not.
Thank you sir for this video
Most welcome!
good explanation
Thanks for comments!
very good job
Thanks for comments!
Why this was not divided into test/train?
Here just illustrated how to do SVM in R. But you are 100% correct, if you are applying it to any problem, make sure to split data in test/train.
@@bkrai thank you sir for your response. Also if you could answere, i tried this on pima indian diabetes dataset (very famous); except for sigmoid I coudn't see colored boundaries (+ve and -ve catagory) for any other function and the misclassification error is least for linear, yet the algorithm (your method to find out best function) says that radial is the best one, can you guess what could be happening under the hood?
Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
Need numeric dependent variable for regression.
why do I always get this error whenever I'm using this formula?
mymodel
What is dependent variable in your data?
Thank you for your response. I also tried the iris data and follow the tutorial, but still got the same error.
Thanks again, sir! please upload the R file sir.
tab
Make sure pred and actual have same number of data points.
where can i find your r code ???
Here is the link:
drive.google.com/open?id=0B5W8CO0Gb2GGc1ZZQWhmMmpuWWc
Thank you so much Sir!
Most welcome!
why didnt u split data to test and train before
It is always good to split data. I didn't do it here to keep the video short.
If I splited data, which data I would be performing the SVM models on, test or train
And Thank you professor:D
We develop the model using train data.
You are welcome!
excellent really worth
Thanks!
ROC Curve & AUC value Demo should be here
You can find them here: ua-cam.com/video/ypO1DPEKYFo/v-deo.html
what is set.seed ?how do we decide set .seed value?
you can choose any number you like. And then you can use that same number when you try to repeat analysis with same results.
That music....kept me awake
😊