Webscraping in R

Поділитися
Вставка
  • Опубліковано 18 лис 2024

КОМЕНТАРІ • 33

  • @alanscott9258
    @alanscott9258 Рік тому +3

    Kasper, Just working through your tutorial this week and it is excellent. It is obviously some time since you did the video and the coding in the IMDb has changed. For example the CSS selectors now have different names which just makes it a bit more challenging and interesting. Thanks for doing this.

    • @kasperwelbers
      @kasperwelbers  Рік тому

      Thanks, and double thanks for framing my outdated CSS selectors as a learning challenge :). Still, I think I should then update them at least in the document, so (third) thanks for the heads up!

  • @moviezone8130
    @moviezone8130 6 місяців тому

    Kasper, I found it very helpful, it was a great video and you set the bar high. Very very informative filled with concepts.

  • @mindandresearch
    @mindandresearch 3 місяці тому

    You should make more and more videos. You explained this on point! Like on R and everything on it you will surely be the best no doubt!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 роки тому

    Congratulations! This is an excellent and lucid explanation of how to web scrape with Rs rvest. I had no idea that was this simple.

  • @raould2590
    @raould2590 Рік тому

    Excellent one! Thank you for this! Well structured & explained and very useful!

  • @timmytesla9655
    @timmytesla9655 Рік тому

    This was really helpful. Thanks for this awesome tutorial.

  • @R0bbie4141
    @R0bbie4141 Рік тому

    Hey Kasper. Bedankt voor je gratis youtube premium in een airbnb in Berlijn afgelopen week 😅. Ik heb voor je uitgelogd toen ik naar huis ging. 👍🏻

  • @Kinglium
    @Kinglium 2 роки тому +1

    thank you so much for all your hard work! I learned a lot from this video!!

  • @Quienescribiohoy
    @Quienescribiohoy 2 роки тому

    Thank you for this video, it was really helpful.

  • @pieracelis6862
    @pieracelis6862 7 місяців тому

    Really good tutorial, thanks a lot!! :)

  • @Ryan-vc9gc
    @Ryan-vc9gc Рік тому

    Awesome video thank you

  • @hectormercedes6553
    @hectormercedes6553 2 роки тому +1

    THANK YOU TEACHER, VERY IMPORTANT, IM NEWBIE

  • @harutyunhakobyan4534
    @harutyunhakobyan4534 2 роки тому

    Thank you very much very helpful

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Рік тому

    Again thanks for the fine presentation. How about xpath? Have you considered covering that? I was hoping that would help with a table I was scraping but I could not figure out what to hang my hat on. The website is very unusual. You can view a table (the one I would like to scrape) but the code returns a list of three tables not one. What is annoying is that the html code has no distinct tags or marks to work with.

    • @kasperwelbers
      @kasperwelbers  Рік тому

      Hi Haraldur. It's true that xpath is a bit more flexible, so that might help address your problem. But you can also get quite creative with CSS selectors. If there are no distinct tags/ids/classes or whatever for the specific element you want to target, the only way might be to target the closest parent, and then traverse down the children based on tags and their positions. For instance, something like: "#top-nav > div > nav > ul > li:nth-of-type(2)".
      What can help with those types of annoying long paths, is to use something like the Google Chrome "SelectorGadget" plugin (which I didn't know existed when I made the video). This let's you select an element on a page and it gives you either the CSS selector or XPath.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 Рік тому

      @@kasperwelbers
      Kasper,
      Thanks for the information. It clearly takes a lot of experimenting. I wound up settling down on these two code options trying it extract the third table:
      html_doc |>
      html_elements("table") |>
      html_table(header = TRUE) |>
      pluck(3)
      pluck is from purrr package (pull will not work here).
      Or using xpath:
      html_doc |> html_elements(xpath = '//center[position() = 3]/table') |>
      html_table(header = TRUE)
      The pluck method is more elegant in my mind but xpath is clearly worth learning at one point or another.
      By the way I am using the native pipe which will not always work but the regular magrittr will.
      H

    • @kasperwelbers
      @kasperwelbers  Рік тому

      @@haraldurkarlsson1147 pluck indeed offers a nice solution here!
      There is certainly some value in learning xpath, as it's more flexible and also works well for XML files. That said, when doing it in R I tend to prefer your first solution, because it's easier to debug. Probably the xpath approach is slightly faster, but in webscraping the main speedbump is the http requests so in practice I think the difference in speed would hardly be noticiable.

  • @Aguaires
    @Aguaires 6 місяців тому

    Dank u!

  • @thomasberthelot9187
    @thomasberthelot9187 2 роки тому

    Hi! I'm a newbie and the 7th and 8th lines, I get : "Error in read_html(url) : could not find function "read_html" ". Could you pleas telle me what's wrong ? It's the same when I run "%>%", I get : "Error in read_html(url) %>% html_element("table.wikitable") :
    could not find function "%>%" ". Same with the library(tidyverse)

    • @thomasberthelot9187
      @thomasberthelot9187 2 роки тому

      when I run the 7th and 8th lines*

    • @thomasberthelot9187
      @thomasberthelot9187 2 роки тому

      I had already run "# install.packages("tidyverse")"

    • @thomasberthelot9187
      @thomasberthelot9187 2 роки тому

      Nvm it worked bc I downloaded it directly from the "Install" button in "Packages"!
      I'm happy

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 роки тому

    Very useful! But my issue is not with the R code but rather reading through the html code and finding the right places. I had a heck of a time. For one thing the font was tiny and an enormously long code. Is it searchable?

    • @kasperwelbers
      @kasperwelbers  2 роки тому +2

      Dear Haraldur, the hardest part of scraping is indeed not so much the code, but learning how to find and select HTML elements. The good news, though, is that this process is the same regardless of what scraping software you use. So if at some point you're using Python, it still applies, and there are also tools for automating an actual webrowser such as Selenium (and RSelenium) for which the main task is also finding and selecting HTML elements. That being said, there are great tools for searching through the HTML code. The main one being the Inspect option, as also discussed in the tutorial. If you use the one in Chrome, you can search for both strings and css-selectors. So that's a great way to find elements and figuring out how to select them with rvest. Also, note that if you right click an element on the webpage and select inspect, it automatically shows the HTML code for this element.

    • @haraldurkarlsson1147
      @haraldurkarlsson1147 2 роки тому

      @@kasperwelbers
      Kasper, I have indeed used inspect in Chrome. My main problems is that the font is so small that I have a hard time reading it.

    • @kasperwelbers
      @kasperwelbers  2 роки тому +2

      @@haraldurkarlsson1147 Ahhh like that! You should be able to change the font like any content on a webpage. In Chrome, at least for me, it works by holding ctrl and then scrolling up/down on my mouse. This change the font size for whatever window you're pointing at.

  • @brittnyfreeman3650
    @brittnyfreeman3650 Рік тому

    Where is the html tutorial link that you mentioned? It’s not in the description of the video.

    • @kasperwelbers
      @kasperwelbers  Рік тому

      Hi Brittny, you're right. I replaced the HTML file with a .md file (the rmarkdown file is already knitted), because somehow links to the html file on github didnt work. Did you need an html version in particular?

  • @erolarmstrong
    @erolarmstrong 2 роки тому

    html_element('table.wikitable')
    Error in UseMethod("xml_find_first") :
    no applicable method for 'xml_find_first' applied to an object of class "character"
    i am getting this error while finding for the html node

    • @kasperwelbers
      @kasperwelbers  2 роки тому +1

      Hi Erol. I suspect you are now calling html_element(...) by itself, and not within a pipe.
      The first argument of html_element should be an html page, which you create with read_html. So that would look like:
      html_page = read_html("some url")
      html_element(html_page, ".wikitable")
      But the pipe operator allows us to write it like this:
      read_html("some url") %>%
      html_element(".wikitable")
      In this case the output of read_html is plugged into html_element as the first argument.