Web Scraping with Python - Get URLs, Extract Data

Поділитися
Вставка
  • Опубліковано 17 жов 2023
  • Join the Discord to discuss all things Python and Web with our growing community! / discord
    This is the third video in the series of scraping data for beginners. We're going to add functionality to scrape from the actual product pages rather than just the search page. Adding in dataclasses will also help us handle our data.
    This is a series so make sure you subscribe to get the remaining episodes as they are released!
    If you are new, welcome! I am John, a self taught Python (and Go, kinda..) developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly content.
    :: Links ::
    Recommender Scraper API www.scrapingbee.com/?fpr=jhnwr
    My Patrons Really keep the channel alive, and get extra content / johnwatsonrooney (NEW free tier)
    I Host almost all my stuff on Digital Ocean m.do.co/c/c7c90f161ff6
    I rundown of the gear I use to create videos www.amazon.co.uk/shop/johnwat...
    :: Disclaimer ::
    Some/all of the links above are affiliate links. By clicking on these links I receive a small commission should you chose to purchase any services or items.
  • Наука та технологія

КОМЕНТАРІ • 34

  • @hreedaymishra7761
    @hreedaymishra7761 8 місяців тому +6

    Thank you please continue this series

  • @daveys
    @daveys 8 місяців тому +2

    Excellent video series, much appreciated. Thank you for posting.

  • @0x_nietoh
    @0x_nietoh 6 місяців тому +1

    John, you've made me re-enjoy scraping. I gave up due to how frustrating most tutorials are and the lack of real-world application with all of those stupid scraping demo sites. Thanks for all you do man

  • @thebuggser2752
    @thebuggser2752 5 місяців тому +1

    Another great presentation! Neat use of kwargs. Also, a very relevant use of data classes.

  • @eduardop5487
    @eduardop5487 8 місяців тому +7

    Excellent video, great learning experience

  • @milyastroc
    @milyastroc 8 місяців тому +1

    This is very helpful! I appreciate it a lot.

  • @abdifatahabdi3939
    @abdifatahabdi3939 8 місяців тому +1

    you are genius man, thank you very much

  • @Fabricio-mq2uk
    @Fabricio-mq2uk 8 місяців тому +1

    Thank you very much big John!

  • @Lorem04
    @Lorem04 8 місяців тому +3

    thank you! we need more of this sh!t
    and i hope a serie like this of BeatifulSoup either

  • @Mac_Edits1
    @Mac_Edits1 7 місяців тому

    "parse_page(html)" from lesson 2 suddenly became "parse_search_page(html: HTMLParser):" in lesson 3 without any explanation. Anyway great tutorial as well as a whole series. Very good for beginners.

  • @michaelscheider6414
    @michaelscheider6414 8 місяців тому +1

    very very good

  • @muhammadhaddid9927
    @muhammadhaddid9927 8 місяців тому +2

    Hi kindly make a video of python with Selenium because no updated chrome driver available so I don't know how we run script now.
    Thanks

  • @user-ro2vo4lq1g
    @user-ro2vo4lq1g 10 днів тому +1

    From this video is not understandible for beginners, untill you decided for some reason to change all the code

  • @AliceShisori
    @AliceShisori 8 місяців тому +2

    if we can combine playwright with this, then basically we can scrape any dynamic sites? (e.g: social media websites)
    thank you so much John this series is very fulfilling.

    • @JohnWatsonRooney
      @JohnWatsonRooney  8 місяців тому +1

      Essentially yes. This is why I separate out the parsing to the request, dropping playwright or selenium in is easy

  • @juampivitalevi9611
    @juampivitalevi9611 3 місяці тому

    Great video! Question: How can I find the extension that provides you with the errors next to the code?

  • @bakasenpaidesu
    @bakasenpaidesu 8 місяців тому +2

    Ohayou ❤

  • @KushalSharmatheOne
    @KushalSharmatheOne 8 місяців тому +1

    Man, your videos are great. Your videos on playwirght have really been helpful. I was able to follow your videos and then make my own playwirhgt script in my project. Until I got stuck dealing with dynamic pop-ups. I am unable to get past those. I am supposed to enter a piece of data in those pop-ups (not like captcha stuff). Just unable to make it work. It would help if you could cover dealing with dynamic pop-ups. Thanks.

  • @acharafranklyn5167
    @acharafranklyn5167 7 місяців тому

    Nice job is there a way to put this whole stuff in a cron job or scheduler to run intermittently

  • @darylkell12279
    @darylkell12279 6 місяців тому +1

    Good series! Personally I think the yield is a nice touch but probably not needed here based on the weight of the script (and the generator itself doesn't help iteration as was described as the reason for its inclusion), the dataclass is overkill vs a dict (we end up converting out to dict anyway), and so is **kwargs vs a single kwarg that defaults to something like False or None (gives an impression there may be more than a single kwarg, easier just to use a single one that defaults to a value when not passed in). Got a subscribe from me, thank you :)

    • @JohnWatsonRooney
      @JohnWatsonRooney  6 місяців тому

      Thanks - all valid points, I think I was guilty of trying to shove as many things that you can use into a script that doesn’t need them, for demonstration purposes

  • @SkullTraill
    @SkullTraill 8 місяців тому

    Can you show how we can do this on websites where we have to log in first?

  • @jaswanth333
    @jaswanth333 8 місяців тому

    Also kindly add the product urls column for each product and make it clickable when writing to CSV

  • @samoylov1973
    @samoylov1973 8 місяців тому

    Based on one of your previous videos figured out, how to get nested objects from tricky div's . Thank you!
    Could you please advise, how in function below do I get not only 's but also 's, 's and 's elements?
    Should it be some sort of pipe like syntax "div.article-formatted-body > div > p | h2 | pre | ul | li |"?
    def read_article(html):
    article_body = html.css("div.article-formatted-body > div > p")
    paragraphs = [i.text() for i in article_body]
    print(*paragraphs, sep='
    ')

  • @bathuudamdin
    @bathuudamdin 8 місяців тому

    Hi John, what is the fastest scraper for webpage with dynamically loaded content. I am using selenium and find it very slow in terms of speed. Any other options?

  • @abhin.v4981
    @abhin.v4981 5 місяців тому

    Great video! You've got a subscriber. After trying out the code a couple of times, I came across ReadTimeout error. How do we fix that?

  • @nuno2032
    @nuno2032 8 місяців тому

    Beautiful job. How can I find the code?

  • @atatekeli9295
    @atatekeli9295 5 місяців тому +1

    Shouldn't item number an integer and price being float?

    • @JohnWatsonRooney
      @JohnWatsonRooney  5 місяців тому

      ideally you want decimal for price. I tend to leave them as a string until i know how i want to handle them

  • @sallycakes472
    @sallycakes472 8 місяців тому

    thanks heaps for these John, can we please get the code into a pastebin or something pls? 🙏

  • @playboipablo
    @playboipablo 8 місяців тому

    can you stop smashing your keyboard