Introduction to Species Distribution Modeling Using R
Вставка
- Опубліковано 8 лис 2024
- This video is part of a course on Ecological Dynamics and Forecasting: course.naturec...
Data used in this video: course.naturec...
Lesson material related to this video: course.naturec...
This was a very nice tutorial. Never thought I'd listen to stuff like this smiling ;)
Thanks! Very glad it made it fun to learn.
You helped me finishing my master thesis. Thank you very much. I had problems with model performance (very low AUC of just 0.54) due to a lack of absence points and overrepresented presence data. Dealing with this wasnt easy for me. I used maxent as model but the steps are nearly the same.
This, Dan Warren + Richard Pearson talks are amazing for SDM/ENM.
Clearest introduction I've come across - thank you for your efforts : )
Glad it was helpful!
This was a very helpful and clear introduction, makes it so much easier to understand thank you!
Glad it was helpful!
It is so simple and helpful. Thank you very much for this tutorial.
You're welcome!
Thank you so much. The best video I came across regarding this topic. Thank you once again. It was so clear.
So glad that it's helpful!
This is by far the best SDM video I have come across. Thank you so much for your efforts. I was just wondering, how did you download the env data? I want to use the annual average temperature, tmax, and the annual average precipitation in my model.
Thanks! The code we used to generate both the species distribution data and the environmental data is here course.naturecast.org/lessons/r-species-distribution-models/supplemental_code/ Let us know if you have any questions
@@weecology Thank you so much for this. I went through the code, I realise it is still on geoData which does not work anymore, we now use geoData. I ran the code it works but I am not sure how I must select the specific bioclamatic variable I want to work on, I tried replacing the tmin with BIO1 but I get the following error "Error in geodata::worldclim_country("South Africa", "CMIP5", var = "BIO1", :
var %in% c("tavg", "tmin", "tmax", "prec", "bio", "bioc", "elev", .... is not TRUE" and I am not sure how to proceed. Thanks
Very nice video. very informative and useful
Thank you for awesome video. This was a very helpful.
Glad it was helpful!
Thanks for this great video for a simple SDM exercise.
Glad it was useful!
Great video, I have been suggesting this to some of my students to get into SDM using GLMs :)
Thanks! That’s great to hear!
This is amazing. Thank you so much! Is there a similar video available for when we don't have the absence data and use background data?
Thanks! Unfortunately I don't have one for background data and am unlikely to have time to make one in the near future. I have provided some links in response to a few other comments on sources for looking into background data a little more, so hopefully some of those will be helpful.
Hi, that's the nicest tutorial i came across. Well, I want to predict a conflict risk map based on the presence/absence of conflict data. I want to ask can I use GLM for this? Also i need to add more than 2 variables. Kindly help me out
Sir please make a video for how to comparison for correlation of bioclimatic variables?
You are my hero!
Thank you very much for the nice explanation! It was very helpful!
You are welcome!
@@weecology Is there any chance that you could make a tutorial video on making an sdm with maxent or a method like ensemble?
@@goffehoving5383 Sadly I won't have time to do this anytime soon
Thank you so much. I wonder if there a way to plot ROC curve with ggplot.
You're welcome! It is possible to do this in ggplot, but you have to extract the information from dismo instead of doing it automatically. It should look something like this:
# Extract the true positive rate (TPR) and false positive rate (FPR)
tpr
Could you make an other video on Specicies distribution modeling. I am interested in this subject
Thanks for this helpful tutorial. I was wondering, why you did not check for the diagnostics of your model, how do you interpret the normality of residuals etc. for binomial data? or is this not relevant?
Those things are all certainly important for any model. You just can't cover everything in depth in a 45 minute demo (which is the length of the class period I made this for)
@@weecology thank you, yes that makes sense! I am just having a hard time interpreting the diagnostics of a binary model as I have found that those are often difficult to interpret for binary binomial GAM SDMs
Thank you very much for this video - this is very helpful. Sorry to ask this question, must all environmental rasters have the same resolution in the predict() or could they be different?
Glad it was helpful! Yes - I believe they have to be the same resolution. predict() takes a single RasterStack or RasterBrick object to make predictions on. RasterStack is definitely the same resolution for all layers and I believe the same is true for RasterBrick, though the documentation is less clear.
@@weecology Once again, thank you very much for the warm reply and answers to my question regarding the handling of rasters and making their resolutions same across all.
@@anwmus Our pleasure!
this is excellent tutorial for beginner in SDM. Can I know how you get absence data with that zeros? I'm planning to use BRT in my research.
and is there tutorial where you get that .grd files?
@@JibHyourinmaru The data we used have real zeros, so we don't have to deal with generating absences. I think in response to one or more other comments I've linked to some resources about doing that, but I'm definitely not an expert.
The code for generating the .grd files is here: course.naturecast.org/lessons/r-species-distribution-models/data_preparation_code.R
It uses the getData function function from the raster package:
tmin_now
@@weecology thank you so much. keep up the good work =)
Thank you for uploading an excellent video. I came to watch this video while studying Maxent. Can I have a couple of questions if you don't mind? Considering that background point is used in Maxent, why is it that this background point should be located around presence point? To my understanding, background point is a point on the hypothesis that background point doesn't generate species, isn't it? If I'm right, I reason that it is right to be located far away from presence point, but it isn't that way and I would like to know why.
Our group typically works with data that has absences, so I'm not an expert here. This link includes a decent introduction to selecting background data for MaxEnt:
onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2013.07872.x
Great tutorial. Thank you!
Thanks!
Thanks for these great videos! I am wondering about scaling issues. What if you are running a regression and need to scale your covariates, how would you apply this to your rasters?
In most cases you can do math (like the kinds of subtraction and division used for scaling covariates directly on rasters). Check out our video on Mathematical Operations With Raster Data to see more details: ua-cam.com/video/l1sYh6eRozg/v-deo.html
Hello, I am once again watching your video which the knowledge shared here never gets old! Please, I was wondering if this technique can be applied to other domains of research outside of species ecology? For e.g., on occurrence data for point crime, or disasters (fires or floods)?
Yes, the general concept works for anything where you are trying to determine if some outcome is likely to occur at some location where spatial correlates of that outcome occurring are considered reasonable predictors for occurrence in unsampled or future environments.
@@weecology Once again, thank you very much for the reply and answer. I am happy to read this it has wider application. I have fallen in love with this technique!
Thank you for this tutorial!
I am stuck with a problem with my data.
I have presence only data and presence points are less than 50. Therefore, if I want to run GLM with my data how many background points should I generate? What is the rule of thumb?
Our group typically works with data that has absences, so I'm not an expert here. There are lots of papers on selecting background points and I'd recommend taking a look at this. Here's a link to one that includes a decent introduction to selecting background data for MaxEnt models:
onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2013.07872.x
great video,
just a question, how you decide the suitability thresholds to binarize a map into 0-1 (absence/presence) when you have more than one predictive model (ensemble modeling), since different models wil have different suitability cut-offs at which the models optimally distinguish between suitable and unsuitable locations?
Thanks! Unfortunately the details of threshold selection in an ensemble modeling context are beyond my expertise. Based on experience in other areas that may or may not apply to this problem you could consider if you can determine the threshold after the ensembling step. So, ensemble to "probabilities" and then determine the threshold on those ensembled "probabilities".
@@weecologythanks, it is a reasonable way
Do you have more workshop or training on this? I'm under project base and we use SDM for our future predictions and we are dealing with fish
No, this is the only one unfortunately
Thanks for this video;🙏❤
How can I have the script of this education?
It is better to be available script and data.
There is a link to the written version of the tutorial and to the data in the description for the video. This should get you access to both as well: course.naturecast.org/lessons/r-species-distribution-models/
Dear sir, please make video on ANN based predictive modelling....
Hi, Thanks for the video. I have more than one specie. At each point, more than an individual was recorded. I want to use LULC as environmental variable. What model could best suit analyse this kind of data?
The easiest way to handle multiple species is to model them separately, but this can lose useful information about the relationships between the species. There is a growing set of models for "joint species distribution modeling" that allows you to fit multiple species in a single model. With multiple individuals you can either convert them to presence-absence, which is what I did here, or try to model abundance instead of presence-absence. Modeling abundance you need a model that yields integer values for y. You can do this using a Poisson or Negative Binomial link function and a linear model with a Generalized Linear Model (GLM), as long as it's safe to assume that the responses to your environmental variables are linear (the same assumption I made here for probability of presence). LULC probably ends up being multiple predictors, one for each type of use/cover in typical situations. Just be careful that if they are percentages (or proportions) that have to sum to 100 (or 1) then you have one less independent variable than you think (since once you know the percentages of the n-1 cover classes by definition you know the percentage of the last cover class).
could you please make a video on how to use maxent algorithm for presence-background data
Our group typically works with data that has absences, so I'm not an expert here, but this link includes a decent introduction to selecting background data for MaxEnt:
onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2013.07872.x
@@weecology Thank you so much sir! ❤️🇮🇳
Hiya I love the video, I read a previous comment for this use in maxent to generate absence data, to what extent to i need to replace code in this. I also plan to do a Multivariate GAM model, I am guessing I can use code for maximum entropy to get absence data points only or does the maxent code generate the whole model I.e. what code is still relevant to use. Currently working on my dissertation on SDM of seagrass it would really help if you can provide some insight!
The maxent package will fit the model. An optional argument will have that model fitting function automatically select background points for you. So in general you should be able to swap out the model fitting step but use a lot of the other code here. That said, I'm not familiar with the subtleties of evaluating maxent models that use background data, so I'd recommend taking a look at this paper (onlinelibrary.wiley.com/doi/pdf/10.1111/j.1600-0587.2013.07872.x) to make sure it's OK to use standard evaluation metrics with background data as opposed to true absences.
Amazing... Thanks brother
Very glad it's helpful
Hi, this video is amazing and very helpful. However I am asking myself how did you get a raster that consists of two variables (precip and tmin)? I was not able to merge my own (depth and chlorophyll). Can you help?
Thanks! You should be able to do this using the stack function from the raster package. If you have two different raster files you run stack with both file paths to load them. E.g.,
data
Thanks a lot, it is a great job. I need some help please, it is very important for my PhD research work. I want to download Historical monthly weather data from Worldclim for 2010-2018 time period. And use this data for current prediction. Do you know R scripts for this data? Or how I can change CMIP5 data to CMIP6 weather data? Thanks in advance.
Thanks Aida! I'm not aware of anything for showing you how to do either unfortunately as they are a little more complicated. I've started playing around with a good demo for time-resolved Worldclim data and will try to post something when I get the chance.
You are a life saver.
Very happy to hear it
Hi this is great. Was wondering if you might be able to help me. I’ve run a point sample analysis to gather the data from 17 variables at each individual point I have in a dataset. How can I display this in a scatter plot. Like just the 17 variables as they are just for visualization to show how similar they are and told that I don’t need to plot it against raw data/plot against anything
Probably the easiest thing if I'm understanding the question is the use the `pairs()` function which will produce the bi-variate scatter plots for every pair of variables. See r-coder.com/correlation-plot-r/#Plot_pairwise_correlation_pairs_and_cpairs_functions for some examples.
Hey, thanks for the amazing tutorial! I tried to apply this to my own data but when I want to extract the raster values for each location, it returns NA for all locations. Do you know what might be the issue here?
Glad it was helpful! It's a little hard to know why you're getting NA's without seeing the details, but the most likely issue is that the point locations and the raster aren't in the same projection and so the points appear to the computer to not be in the same place as the raster. We have a set of videos that might help with this in our Geospatial Data in R playist:
ua-cam.com/play/PLD8eCxFKntVHGCOTs6Cvo_EAp1Xh-yVoQ.html
@@weecology thank you, that was the problem indeed! Another question: can I also visualise more than two environmental variables in the scatterplot as described at minute 14:00 ? If so, what would the code be as there is only one x and one y axis?
@@ameliefriedsam4090 Visualizing in >2D environmental space is a tricky problem. You have two general options: 1) Look at all pairs of variables plotted against one another. This doesn't give you a full picture of the multi-dimensional space but it does let you visualize everything. You can use the corrplot package of the ggpair in the ggally package to do this for all possible pairs of columns in a data frame. 2) You could plot three variables by adding a third dimension to the plot using one of the packages listed here www.r-graph-gallery.com/3d.html. Beyond that you get into multi-variate statistics and needing to compress the information into fewer axes, but that makes it harder to interpret the visualization.
Is there a way to view the thresholds at individual points on the roc curve?
On the predictions maps after setting the threshold, all predicted areas seem to be predicted with a likelihood of 1 (full green). Is that right? Or maybe just my screen not showing the shades well?
> Is there a way to view the thresholds at individual points on the roc curve?
The threshold data is stored in the evaluate object, in our case in evaluation@t. It's in the link form, so if you want it in the interpretable 0-1 form that we discuss you need to add the optional `type = "response"` argument to the evaluate function. If you want to label the plot you probably also want to reduce the number of points substantially, which you can do using the `tr` argument. Here's some rough draft code:
library(ggrepel)
evaluation = evaluate(presence_data, absence_data, logistic_regr_model,
type = "response", tr = seq(0, 1, 0.1))
eval_data = data.frame(tr = evaluation@t, tpr = evaluation@TPR, fpr = evaluation@FPR)
ggplot(eval_data, aes(x = fpr, y = tpr)) +
geom_point() +
geom_label_repel(aes(label = tr))
> On the predictions maps after setting the threshold, all predicted areas seem to be predicted with a likelihood of 1 (full green). Is that right? Or maybe just my screen not showing the shades well?
The way this is shown in the video we aren't showing probabilities anymore, where just saying "plot a filled cell if the probability is above some value". You'd typically want to remove the color ramp to present this. If you want to still display the probabilities but only those above a certain value then you need to mask or replace the other probabilities. To get the same basic plot with with the probabilities just set all of the predicted probabilities below 0.5 to 0:
predictions[predictions < 0.5] = 0
plot(predictions, ext = extent(-140, -50, 25, 60))
You can also set them to NA, which would generally be better, but then you'd need to add a vector outline to be able to see the rest of the continent.
Great work it is 😊
Thanks!
Thanks for this!! So how would I go about converting my .tif files from arcgis into .grd/.gri?
All of the major raster packages in R will load .tif files, so no need to convert them.
@@weecology cool, so pretty much just follow your tutorial then, correct?
Yes, it should (hopefully) “just work”. Let us know if it doesn’t
@@weecology Will do! I have multiple variables in tif files (ie soil types, max temp, min temp, ave precip). Your approach was just to use 2 variables, but can i use your example for my attempt at mapping? Sorry for all the questions.
No problem. Yes, basically just the same thing but with a few more variables in the model
Thanks for 한글자막
The translation is all Google's work, but we do work hard on the English subtitles to make it easier for the algorithms to translate. Glad it works!
I have 2500+ data point of lizards including lat and long with different color morph, I want to see the distribution of different color morph in different region and how it has been changing over the years, how can I do this.
ahaha! :D it was worth the watch till the last second! aa-la-la-la-laaa
Glad you're enjoying the outtakes!
I have some trouble loading the .grd files at the very beginning of the video form the folder on my desktop into R as apparently the stack function doesn't allow the full path?! Appreciate the help.
stack will take full paths (I just tried it out to confirm). A couple of things to check: 1. Did you unzip the zip file; 2. Is the path typed correctly; 3. Did you load the dismo package (which loads the raster package). The last one is probably the most likely and most confusing possibility. There is a "stack" function in base R that does something different and that won't work if you run the provided command without loading the dismo (or raster) package first. Once you load one of those packages the based "stack" function gets replaced with the one that loads raster stacks.
Can you do a tutorial that uses background data instead of absence data?
I am getting error in the extend function, how to rectify?
Hey, great stuff... I'm just having a bit of a situation in the function that extracts the environmental data and the species location. Rstudio sends me a message that says ... "Error in extract(env_data_current, hooded_warb_locations) :
object type 'S4' is not subsettable "... Do you know how to fix this? Greetings from Mexico =)
I checked the code on my machine and everything runs normally. Based on the error message your seeing I suspect that something went wrong prior to this line of code and that the `env_data_current` object, which should be a raster, isn't a raster for some reason (or maybe a similar issue with `hooded_warb_locations`). If you look at `env_data_current` it should be a RasterStack and `hooded_warb_locations` should be a two column data frame with `lon` and `lat` columns. If one of those isn't what is expected then try checking the code creating that variable. If they both look OK I'd try reinstalling the `raster` package (which gets loaded by `dismo` and is where `extract` comes from.
It happened to me as well, please re-install the packages, it did work after the installation.
Hello bro. I want to do analysis of three species endemic in a particular region. How can I read this data in R. I would als want to species the species with different colours.
Hi! would like to know how to import the predicted results (predictions) to ArcGIS.... would like to create a map layout in ArcGIS from the result of the SDM.
The `predictions` object is a raster, so you can write it out into common formats using the raster package:
library(raster)
writeRaster(predictions, "sdm_predictions.tif")
The file extension that you provide to the filename (the second argument) in writeRaster will determine what raster format is used to save the file. I used a GeoTIFF in this case.
> sdm writeRaster(sdm, "E:/FRM 291/SDM/FRM 291 SDM/Aquilaria_SDM/sdm_predictions.tif", format = "GTiff")
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'writeRaster': object 'sdm' not found
What do you think id the problem?
@@tomasjrreyes4392 It looks like the error is happening when trying to load the "predictions.asc" file, which R doesn't think is a raster file. As a result `sdm` doesn't exist and `writeRaster` fails. I just generated the `predictions` object following the tutorial and everything works.
Did you try to store that object in a file?
@@weecology thanks for your help... I was able to resolve the issue by adding this command/function... "writeFormat()"
Thanks for your help, ;)
Glad you got it figured out!
Which r and r studio version are you using?
I'm not sure which versions they were since I recorded these a couple of years ago. It would have been whatever versions were current in Fall 2020.
I have data in tiff format can I process is it R?
I also want to ask which R packages are required for species distribution
You can read geotiff data using the standard packages for rasters (raster & stars) in R. In terms of packages it depends a bit on what kind of modeling you're doing. dismo is a standard package for evaluating distribution models, so I'd say dismo + whatever modeling packages you want to use, e.g., maxent or mgcv.
Hello. I`m getting stuck at the first step itself. Im getting an error
hooded_warb_data = read.csv("hooded_warb_locations.csv")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'hooded_warb_locations.csv': No such file or directory
Hi Sarvesh - this error means that the data files aren't in your working directory. If you've already downloaded and unzipped the files linked in the description then they must be in a different directory that RStudio is working in . This video may help with figuring that out: ua-cam.com/video/2sReMmTMYFk/v-deo.html
@@weecology thank you very much
i have a shap file and then i have csv separately how i join these two file to display the points on my map
This playlist on working with Geospatial data should walk you through all of the relevant steps
I am trying to download hooded warb locations. CVS but I am unable any alternative solution please??
The data for this demo is linked in the description. It's a zip file so you'll need to unzip it and then you'll have all of the relevant data.
Hi. This is an awesome video. But, I have one problem. When I try to run the stack line, the feedback of R send me this:
Error in data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 4, 0
this is my code:
options(java.parameters = "-Xmx1g")
library(raster)
library(dismo)
setwd("C:/Tesis_MDE")
occs
I'm glad it's useful!
What is the output of list.files("./M/presente1","*.asc$",full.names=T)?
What are the units?
If you're asking about the units of the output from the logistic regression, then it's a probability (and therefore unitless), though the probability is not necessarily a well calibrated hence the discussion of thresholds.
I am not able to load env_current files...can anyone please help
What error message are you getting?
Visually it looks very bad in R , like when children play.
This has nothing to do with how things look in R in general. This is a lesson designed to focus on modeling not on data vis and so, following good pedagogy, I focus only on what I'm teaching not on adding a bunch of extra stuff related to data vis.