Modern HTML Scraping with Pythons BEST Tools

Поділитися
Вставка
  • Опубліковано 13 тра 2023
  • There's still plenty of modern sites that are HTML and can be scraped using simple methods. In this video I code from scratch a complete web scraping project up to saving the data. I will use dataclasses, handle responses, use urljoin and scrape detail pages and pagination.
    Scraper API www.scrapingbee.com/?fpr=jhnwr
    Patreon: / johnwatsonrooney
    Donations: www.paypal.com/donate/?hosted...
    Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
    Gear I use: www.amazon.co.uk/shop/johnwat...
  • Наука та технологія

КОМЕНТАРІ • 57

  • @sdriding
    @sdriding Рік тому +19

    Don't think I ever did this so it's well over due... You helped me get a job as a software engineer. I used things I learned from your vids to make a project that was instrumental in getting a job offer. Thank you so much, you changed the financial trajectory of my whole family! (for others looking for the same, a major contributor in standing out is having an AWS cert)

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +5

      thank you that's amazing, the reason I do this is to help people and its great to hear! congratulations on your job!

    • @IwoGda
      @IwoGda 10 місяців тому

      What AWS cert is the best?

    • @sdriding
      @sdriding 10 місяців тому

      @@IwoGdaprobably developer associate

  • @runnrnr
    @runnrnr Рік тому +4

    Thank you for your videos! I now link them to people who ask me questions about selectolax. I'm the author of selectolax.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому

      Oh cool thank you! Selectolax is great I use it all the time - appreciate your work!

    • @user-ro2vo4lq1g
      @user-ro2vo4lq1g 14 днів тому

      You should be written better manual, very poor documented

  • @Kicsa
    @Kicsa 10 місяців тому +1

    I have been enjoying your good videos, thank you for everything. I hope in a couple of weeks, I can start making my own programs.

  • @TheJFMR
    @TheJFMR Рік тому +10

    John It would be nice if you make a video on how to apply unit testing or test Driven Development to a web scraping project 😉
    You are a good teacher to teach that

  • @ManuelGonzales-ni9sh
    @ManuelGonzales-ni9sh Рік тому +10

    Great tutorial John! Would you please consider doing a full tutorial on your nvim theme & config?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +8

      Thank! Yes I will do a video on my nvim, I’ve been configuring it a little more recently and will share soon

  • @adarshjamwal3448
    @adarshjamwal3448 Рік тому +4

    Awesome👍👍 tutorial.
    I learned a lot of things from your scraping series. Keep going on.

  • @amarAK47khan
    @amarAK47khan 8 місяців тому +1

    you are a life saver !

  • @yacinehechmi6012
    @yacinehechmi6012 Рік тому +1

    Greetings from Tunisia, Thanks John!!, waiting for that nvim video i would really love to know what you configured in nvim for python development.

  • @samoylov1973
    @samoylov1973 Рік тому +1

    Set Comprehension is a nice touch in this video. While watching, thought of converting to set afterwards. But making it in one and easy go, as you did, is better.
    One wish: when you explain such parts as "When you want to grab all these table information..." (20:19 on timing), please, show at least one piece of it to the end. How to do others, will figure out)

  • @anthonymunnelly20
    @anthonymunnelly20 Рік тому +1

    Excellent. Really, really well-done tutorial on a subject that seems straight-forward, but isn't.

  • @charlescharles4279
    @charlescharles4279 Рік тому

    Awesome tutorial, do you notice any performance drop when using dataclass to save data during web scraping compared to using dicts?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +1

      Thanks! Generally no, the time lost in scraping is in the network connections so I’ve never worried about it much

  • @valuetraveler2026
    @valuetraveler2026 Рік тому +2

    Good to see alternatives for parsing (selectolax), Will use rich now from now on. Dont personally like to use dataclass/pydantic for most work as it has hundreds of fields. But this is cleaner code than imperative style down the page

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +1

      I really like selectolax. And fair enough regarding dataclasses - for me at the moment the benefits outweigh the downsides

  • @malwaredev33
    @malwaredev33 Рік тому +1

    Excellent video content, all videos are understandable for anyone, can you tell me what font/theme you're using in vs code in this video. Thnaks

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +1

      Thanks! Editor is Neovim and colour scheme is called oxocarbon

  • @rz84vlog78
    @rz84vlog78 Рік тому +1

    The tutorial really helped me. Is it possible to scrape website like college board since the basic authentication of username and password doesn’t seem to work. Would love to at-least get some tips so that I can scrape the bit complex websites.

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому

      Hey thanks glad it helped. For websites that need a login I generally lean towards browser automation (playwright) simply because it is much quicker and easier to get something working. I’d suggest that if you haven’t looked into it already, a few videos on my channel that could help

  • @flashwade888
    @flashwade888 Рік тому +1

    Thank you so much for the detailed tutorial, John!
    I have a quick question - would it be possible to use dataclasses with Scrapy, please?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +2

      thanks glad you liked it! yes you can use dataclasses with scrapy since 2.2

    • @flashwade888
      @flashwade888 Рік тому

      ​@@JohnWatsonRooney Cheeeeers!! I cannot wait to give it go!

  • @DrChrisCopeland
    @DrChrisCopeland Рік тому +1

    I have learned a lot from your videos. Can you do any type of tutorial on report generation for the scrapes. My main use case is once I identify a page that meets my requirements, I generate a PDF (or something) that would show the page as it was. I've had terrible luck with htmltopdf and similar libraries (or point me in the right direction). Thanks for what you do!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому

      Are you after just a visual representation of the page? Playwright can do that very easily. Or are you grabbing data and want that in PDF sorry not quite sure what you mean!

    • @DrChrisCopeland
      @DrChrisCopeland Рік тому

      @@JohnWatsonRooney visual representation as far as I can tell (use case is still in the works/fluid). Once an item/listing on the page meets a requirement, save that individual info to a pdf, run some more stuff, then on to the next item/listing. Due to the subject matter, I don't want to put more in the comments, but yeah I'm learning a lot here and it's all going to work on a non-profit I run in the US.

    • @DrChrisCopeland
      @DrChrisCopeland Рік тому

      @@JohnWatsonRooney I will look at playwright as well!

  • @codetechpro
    @codetechpro 11 місяців тому +1

    Hey John I was wondering, is it possible to fill up visa card dynamic form with selenium or playwright?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому +1

      I don’t know that on specifically but I’ve filled out loads of forms with playwright and selenium before, if it loads the page fine you’ll have access to the forms to j out data

  • @tm_Panda...
    @tm_Panda... Рік тому +1

    Hey, I was wondering why you stopped using Scrapy? Was it too big of a framework for the scraping projects you do?
    Great video as always!

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +1

      I found that i preferred to write my own solutions from the ground up with what I was trying to do, scrapy is still a great framework though. I have a video on my channel about it if you are interested in more details

  • @michakuczma4076
    @michakuczma4076 Рік тому +1

    Is this M+ 1M font you use in your ide? very nice and readable

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому

      Yes it is- although I think it’s m+ 2m. It’s great I’ve been using it for a while now

  • @mxdigitalmediamarketplace
    @mxdigitalmediamarketplace 5 місяців тому

    Hello, I a newby at scrapping. When I wrote @Dataclass it did not let me do it, it says it is not an integrer. I using python 3.12, httpx, selectolax and rich. Ase you mentioned in the tutorial

  • @Assxz
    @Assxz Рік тому +1

    hi john, what editor are you using in this video?

  • @DucNguyen-in1xd
    @DucNguyen-in1xd Рік тому +1

    can you give example when select by class?

  • @AhmedAl-Yousofi
    @AhmedAl-Yousofi 9 місяців тому +1

    What editor are you using?

  • @ZhCrypto
    @ZhCrypto Рік тому

    U are innocent programmer ❤

  • @mxdigitalmediamarketplace
    @mxdigitalmediamarketplace 5 місяців тому

    Hello, following your tutorial, I am getting a enrror on line 26
    resp = client.get(url, headers=headers)

    Traceback (most recent call last):
    File "", line 1, in
    resp = client.get(url, headers=headers)
    NameError: name 'client' is not defined

  • @coyoteden8111
    @coyoteden8111 Рік тому +3

    Early morning web scraping lesgo

  • @atatekeli9295
    @atatekeli9295 Рік тому +1

    Hi John, I tried turning your header code into this for macOS
    headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.99 Safari/537.36"
    }
    I use Google Chrome for web scraping, use M1 Chip and use macOS Ventura 13.4, how can I make it compatible for my scraping

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому

      Hi - the user agent header is what we send with the request to the website - it can be anything, you can use the same one I do or any that you can find on google. It doesn’t need to match your system

    • @atatekeli9295
      @atatekeli9295 Рік тому

      @@JohnWatsonRooney Would it cause an error if I write the same code that is not configured to my system requirements

  • @malwaredev33
    @malwaredev33 Рік тому

    Hi, Bro how are you.?

  • @keifer7813
    @keifer7813 Рік тому +1

    What do you do when the elements you want have dynamically changing classes like class="xJdnxidXjejns xIdhdn39db xzIJhdidmn8"

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому +1

      go back up the element tree until you find one that is constant, then reference off of that. I use css selectors so something like "div.constantclass li a" for all the a tags within li tags in divs with class "constantclass"

    • @ankylosis751
      @ankylosis751 Рік тому

      @@JohnWatsonRooney would really love a tutorial on this... and if you made something similar to this dynamic Changing classes can you link me? I'm at my wits end btw superb content manh. its helping me learn python deeply too

  • @richardboreiko
    @richardboreiko 11 місяців тому

    I'm getting an error on page 20 and it's consistent, but the products seem to vary each time the page appears, so they must be getting unordered data from their SQL statement.
    File "C:\Users
    icha\AppData\Local\Programs\Python\Python310\lib\ssl.py", line 1132, in read
    return self._sslobj.read(len)
    TimeoutError: The read operation timed out
    It looks like the last line from your code to be executed was this:
    File "C:\Users
    icha\PycharmProjects\webScraping\JohnWatsonRooney\ModernScrapingBestTools.py", line 28, in get_page
    resp = client.get(url, headers=headers)
    File "C:\Users
    icha\PycharmProjects\webScraping\venv\lib\site-packages\httpx\_transports\default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
    httpx.ReadTimeout: The read operation timed out
    It happens consistently on www.rei.com/c/backpacks?page=20 but the number of products printed seems to vary before the error occurs.
    Do you have any debugging suggestions?