Automated Web Scraping in R Part 1| Writing your Script using rvest

Поділитися
Вставка
  • Опубліковано 2 лют 2025

КОМЕНТАРІ • 42

  • @ukuk9162
    @ukuk9162 5 років тому +24

    your voice makes me feel like I'm on board an airplane hostess of the area

    • @11hamma
      @11hamma 4 роки тому

      honestly man

  • @victorsingam3238
    @victorsingam3238 9 місяців тому +1

    Thank you this was a really good video, easy to follow and well paced.

  • @neguinerezaii3221
    @neguinerezaii3221 2 роки тому +1

    This is a great video. I now know how to get data from one wikipedia page. Is there a way to extract all text from all wikipedia pages?

  • @ayaabdelghany4404
    @ayaabdelghany4404 2 роки тому +1

    You make it look very easy 😅

  • @moeshyassin
    @moeshyassin 5 років тому +1

    Thank you very much for the nice video. Is there a package that can beautify the email contents so that it looks in a formatted structure?

  • @agustinblacker1324
    @agustinblacker1324 6 років тому +2

    Is there a video about automate scrapping in Python? The first one about scrapping was about python and was really useful and awesome. Thanks for being so clear and informative. Keep rocking!

    • @Datasciencedojo
      @Datasciencedojo  5 років тому

      Thanks! That is something we might put together in future! Our free Python web scraping tutorial is here if you need: ua-cam.com/video/XQgXKtPSzUI/v-deo.html
      Rebecca

  • @vitordeholandajo156
    @vitordeholandajo156 5 років тому +2

    Amazing job.

  • @kolawolekushimo
    @kolawolekushimo 3 роки тому

    If you are joining the datetime; say when not all are visible, what are you supposed to join on?

  • @AbhijeetSinghs
    @AbhijeetSinghs 3 роки тому

    Please make a video on clicking a button programmatically on a website using R for data extraction/scraping purposes.

  • @shilpasuresh641
    @shilpasuresh641 4 роки тому

    Hi, I have 52,000 urls and I need to create a search engine so that when they search for their question they get it . How do I do that ? I even have a json file . This should be done using R . If yes I can be in touch with you based on this .

  • @svaughn8891
    @svaughn8891 4 роки тому

    Hi, like your video.
    Copied the code from your code repository, but I get this error:
    > # Create a dataframe containing the urls of the web
    > # pages and their converted datetimes
    > marketwatch_webpgs_datetimes

    • @svaughn8891
      @svaughn8891 4 роки тому

      I when back through your video and at 5:01 there are some lines that creates urls on the screen:
      urls %
      html_nodes("div.searchresult a") %>% #See HTML source code for data within this tag
      html_attr("href")
      however, these are not in the current version of r_web_scraping_coded_example_share.R on your code repository.

  • @Austin-wh4yi
    @Austin-wh4yi 5 років тому +1

    Hi so when I run this marketwatch_webpgs_datetimes

    • @Datasciencedojo
      @Datasciencedojo  5 років тому

      Hey there! This could likely be due to datetimes being tagged under "div.deemphasized span.invisible" during certain times of the day. I briefly went over this in the video, but to help simplify the this, it is in the full script link below the video (see code.datasciencedojo.com):
      # Grab all datetimes on the page
      datetime %
      html_nodes("div.deemphasized span") %>%
      html_text()
      datetime
      # Filter datetimes that do not follow a consistent format
      datetime2

    • @Austin-wh4yi
      @Austin-wh4yi 5 років тому +1

      @@Datasciencedojo thanks for the prompt and detailed answer.

  • @giuliko
    @giuliko 6 років тому +2

    What an awesome video! Congrats and keep the hard work. Hope to see more web scraping videos from you. Great Great video. Thanks a lot.

    • @rebeccamerrett6536
      @rebeccamerrett6536 6 років тому +2

      Thanks, Giuliko! Glad you found it useful. Part 2 is yet to come! Soon!

    • @giuliko
      @giuliko 6 років тому +2

      @@rebeccamerrett6536 I'm looking forward to watch it. You are by far my favorite R channel on UA-cam. Thanks a lot once again.

    • @rebeccamerrett6536
      @rebeccamerrett6536 6 років тому +1

      @@giuliko Thank you! It means a lot, and encourages to keep going :)

  • @alisaja11
    @alisaja11 4 роки тому

    Hi, thank you so much for the nice video, I am new in this field and this video absolutely helpful for a beginner like me. However, when I run your coding in the part of looping titles and bodies, I got an error message which mentioned that the article didn’t exist. Can you help me to figure out, what could be the cause?

    • @주해람-d1b
      @주해람-d1b 4 роки тому

      Im having the same problem:( have you solved this problem by any chance? Thanks in advance.

  • @ssisteluguharish1305
    @ssisteluguharish1305 4 роки тому

    awesome

  • @winnie_the_poohh
    @winnie_the_poohh 5 років тому

    When I run the code down below new columns named Title and Body are not added to marketwatch_latest_data. Even when I copy your code and run it, it still does not work. What could be the problem?
    marketwatch_latest_data$Title

    • @michellelai6529
      @michellelai6529 5 років тому

      Thanks for such a clear step by step tutorial. I've gotten quite far in, but have faced the same issue as Mickey, where
      names(marketwatch_latest_data) results in [1] "webPg" "DateTime" "DiffHours" only.
      Would you be able to help? Thanking you in advance.

    • @Datasciencedojo
      @Datasciencedojo  5 років тому

      Hey folks! Glad you are following along :)
      Here's what could be happening in regards to your problem.
      It could be that it is not able to collect data that was published within an hour of whatever timeframe you have specified here:
      # Filter rows of the dataframe that contain
      # DiffHours of less than an hour
      marketwatch_latest_data

    • @Datasciencedojo
      @Datasciencedojo  5 років тому

      @@michellelai6529 Glad you are following along :)
      Here's what could be happening in regards to your problem.
      It could be that it is not able to collect data that was published within an hour of whatever timeframe you have specified here:
      # Filter rows of the dataframe that contain
      # DiffHours of less than an hour
      marketwatch_latest_data

  • @paulh1720
    @paulh1720 5 років тому +1

    thanks !!!!!!!

  • @pratyushak4921
    @pratyushak4921 5 років тому

    i have tried to send the mail but it is showing authentication error.. any help?

    • @rebeccamerrett6536
      @rebeccamerrett6536 5 років тому

      Mind sharing the error message? Just checking, are you using gmail?
      Sometimes gmail blocks from less secure apps. Enable 'Allow less secure apps' in your gmail account. You might want to set up a separate email account for this so you don't compromise security on your personal gmail account.
      Or, you could try setting smtp in your gmail account settings.

  • @maktech3936
    @maktech3936 6 років тому +3

    Her voice is soooooooooooooooooooooooooooooooooo pleasing..
    **cough cough
    I meant nice tutorial ❤️

  • @samb.6425
    @samb.6425 4 роки тому

    your way of speaking is very stressful