Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.
Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!
Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.
Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)". What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.
@@kasperwelbers Kasper, Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table: html_doc |> html_elements("table") |> html_table(header = TRUE) |> pluck(3) pluck is from purrr package (pull will not work here). Or using xpath: html_doc |> html_elements(xpath = '//center[position() = 3]/table') |> html_table(header = TRUE) The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another. By the way I am using the native pipe which will not always work but the regular magrittr will. H
@@haraldurkarlsson1147 pluck indeed offers a nice solution here! There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.
Hi! I'm a newbie and the 7th and 8th lines, I get : "Error in read_html(url) : could not find function "read_html" ". Could you pleas telle me what's wrong ? It's the same when I run "%>%", I get : "Error in read_html(url) %>% html_element("table.wikitable") : could not find function "%>%" ". Same with the library(tidyverse)
Very useful! But my issue is not with the R code but rather reading through the html code and finding the right places. I had a heck of a time. For one thing the font was tiny and an enormously long code. Is it searchable?
Dear Haraldur, the hardest part of scraping is indeed not so much the code, but learning how to find and select HTML elements. The good news, though, is that this process is the same regardless of what scraping software you use. So if at some point you're using Python, it still applies, and there are also tools for automating an actual webrowser such as Selenium (and RSelenium) for which the main task is also finding and selecting HTML elements. That being said, there are great tools for searching through the HTML code. The main one being the Inspect option, as also discussed in the tutorial. If you use the one in Chrome, you can search for both strings and css-selectors. So that's a great way to find elements and figuring out how to select them with rvest. Also, note that if you right click an element on the webpage and select inspect, it automatically shows the HTML code for this element.
@@haraldurkarlsson1147 Ahhh like that! You should be able to change the font like any content on a webpage. In Chrome, at least for me, it works by holding ctrl and then scrolling up/down on my mouse. This change the font size for whatever window you're pointing at.
Hi Brittny, you're right. I replaced the HTML file with a .md file (the rmarkdown file is already knitted), because somehow links to the html file on github didnt work. Did you need an html version in particular?
html_element('table.wikitable') Error in UseMethod("xml_find_first") : no applicable method for 'xml_find_first' applied to an object of class "character" i am getting this error while finding for the html node
Hi Erol. I suspect you are now calling html_element(...) by itself, and not within a pipe. The first argument of html_element should be an html page, which you create with read_html. So that would look like: html_page = read_html("some url") html_element(html_page, ".wikitable") But the pipe operator allows us to write it like this: read_html("some url") %>% html_element(".wikitable") In this case the output of read_html is plugged into html_element as the first argument.
Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.
Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!
Kasper, I found it very helpful, it was a great video and you set the bar high. Very very informative filled with concepts.
You should make more and more videos. You explained this on point! Like on R and everything on it you will surely be the best no doubt!
Congratulations! This is an excellent and lucid explanation of how to web scrape with Rs rvest. I had no idea that was this simple.
Excellent one! Thank you for this! Well structured & explained and very useful!
This was really helpful. Thanks for this awesome tutorial.
Hey Kasper. Bedankt voor je gratis youtube premium in een airbnb in Berlijn afgelopen week 😅. Ik heb voor je uitgelogd toen ik naar huis ging. 👍🏻
Hahahaha 🤣. Nice, thanks!!
thank you so much for all your hard work! I learned a lot from this video!!
Thanks! Happy to hear it's helpful
Thank you for this video, it was really helpful.
Really good tutorial, thanks a lot!! :)
Awesome video thank you
THANK YOU TEACHER, VERY IMPORTANT, IM NEWBIE
Thank you very much very helpful
Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.
Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)".
What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.
@@kasperwelbers
Kasper,
Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table:
html_doc |>
html_elements("table") |>
html_table(header = TRUE) |>
pluck(3)
pluck is from purrr package (pull will not work here).
Or using xpath:
html_doc |> html_elements(xpath = '//center[position() = 3]/table') |>
html_table(header = TRUE)
The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another.
By the way I am using the native pipe which will not always work but the regular magrittr will.
H
@@haraldurkarlsson1147 pluck indeed offers a nice solution here!
There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.
Dank u!
Hi! I'm a newbie and the 7th and 8th lines, I get : "Error in read_html(url) : could not find function "read_html" ". Could you pleas telle me what's wrong ? It's the same when I run "%>%", I get : "Error in read_html(url) %>% html_element("table.wikitable") :
could not find function "%>%" ". Same with the library(tidyverse)
when I run the 7th and 8th lines*
I had already run "# install.packages("tidyverse")"
Nvm it worked bc I downloaded it directly from the "Install" button in "Packages"!
I'm happy
Very useful! But my issue is not with the R code but rather reading through the html code and finding the right places. I had a heck of a time. For one thing the font was tiny and an enormously long code. Is it searchable?
Dear Haraldur, the hardest part of scraping is indeed not so much the code, but learning how to find and select HTML elements. The good news, though, is that this process is the same regardless of what scraping software you use. So if at some point you're using Python, it still applies, and there are also tools for automating an actual webrowser such as Selenium (and RSelenium) for which the main task is also finding and selecting HTML elements. That being said, there are great tools for searching through the HTML code. The main one being the Inspect option, as also discussed in the tutorial. If you use the one in Chrome, you can search for both strings and css-selectors. So that's a great way to find elements and figuring out how to select them with rvest. Also, note that if you right click an element on the webpage and select inspect, it automatically shows the HTML code for this element.
@@kasperwelbers
Kasper, I have indeed used inspect in Chrome. My main problems is that the font is so small that I have a hard time reading it.
@@haraldurkarlsson1147 Ahhh like that! You should be able to change the font like any content on a webpage. In Chrome, at least for me, it works by holding ctrl and then scrolling up/down on my mouse. This change the font size for whatever window you're pointing at.
Where is the html tutorial link that you mentioned? It’s not in the description of the video.
Hi Brittny, you're right. I replaced the HTML file with a .md file (the rmarkdown file is already knitted), because somehow links to the html file on github didnt work. Did you need an html version in particular?
html_element('table.wikitable')
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
i am getting this error while finding for the html node
Hi Erol. I suspect you are now calling html_element(...) by itself, and not within a pipe.
The first argument of html_element should be an html page, which you create with read_html. So that would look like:
html_page = read_html("some url")
html_element(html_page, ".wikitable")
But the pipe operator allows us to write it like this:
read_html("some url") %>%
html_element(".wikitable")
In this case the output of read_html is plugged into html_element as the first argument.