Modern HTML Scraping with Pythons BEST Tools
Вставка
- Опубліковано 13 тра 2023
- There's still plenty of modern sites that are HTML and can be scraped using simple methods. In this video I code from scratch a complete web scraping project up to saving the data. I will use dataclasses, handle responses, use urljoin and scrape detail pages and pagination.
Scraper API www.scrapingbee.com/?fpr=jhnwr
Patreon: / johnwatsonrooney
Donations: www.paypal.com/donate/?hosted...
Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
Gear I use: www.amazon.co.uk/shop/johnwat... - Наука та технологія
Don't think I ever did this so it's well over due... You helped me get a job as a software engineer. I used things I learned from your vids to make a project that was instrumental in getting a job offer. Thank you so much, you changed the financial trajectory of my whole family! (for others looking for the same, a major contributor in standing out is having an AWS cert)
thank you that's amazing, the reason I do this is to help people and its great to hear! congratulations on your job!
What AWS cert is the best?
@@IwoGdaprobably developer associate
Thank you for your videos! I now link them to people who ask me questions about selectolax. I'm the author of selectolax.
Oh cool thank you! Selectolax is great I use it all the time - appreciate your work!
You should be written better manual, very poor documented
I have been enjoying your good videos, thank you for everything. I hope in a couple of weeks, I can start making my own programs.
John It would be nice if you make a video on how to apply unit testing or test Driven Development to a web scraping project 😉
You are a good teacher to teach that
Interesting idea, I’ll add it to my list thanks!
Great tutorial John! Would you please consider doing a full tutorial on your nvim theme & config?
Thank! Yes I will do a video on my nvim, I’ve been configuring it a little more recently and will share soon
Awesome👍👍 tutorial.
I learned a lot of things from your scraping series. Keep going on.
Thank you glad I can help
you are a life saver !
Greetings from Tunisia, Thanks John!!, waiting for that nvim video i would really love to know what you configured in nvim for python development.
Set Comprehension is a nice touch in this video. While watching, thought of converting to set afterwards. But making it in one and easy go, as you did, is better.
One wish: when you explain such parts as "When you want to grab all these table information..." (20:19 on timing), please, show at least one piece of it to the end. How to do others, will figure out)
Excellent. Really, really well-done tutorial on a subject that seems straight-forward, but isn't.
Awesome tutorial, do you notice any performance drop when using dataclass to save data during web scraping compared to using dicts?
Thanks! Generally no, the time lost in scraping is in the network connections so I’ve never worried about it much
Good to see alternatives for parsing (selectolax), Will use rich now from now on. Dont personally like to use dataclass/pydantic for most work as it has hundreds of fields. But this is cleaner code than imperative style down the page
I really like selectolax. And fair enough regarding dataclasses - for me at the moment the benefits outweigh the downsides
Excellent video content, all videos are understandable for anyone, can you tell me what font/theme you're using in vs code in this video. Thnaks
Thanks! Editor is Neovim and colour scheme is called oxocarbon
The tutorial really helped me. Is it possible to scrape website like college board since the basic authentication of username and password doesn’t seem to work. Would love to at-least get some tips so that I can scrape the bit complex websites.
Hey thanks glad it helped. For websites that need a login I generally lean towards browser automation (playwright) simply because it is much quicker and easier to get something working. I’d suggest that if you haven’t looked into it already, a few videos on my channel that could help
Thank you so much for the detailed tutorial, John!
I have a quick question - would it be possible to use dataclasses with Scrapy, please?
thanks glad you liked it! yes you can use dataclasses with scrapy since 2.2
@@JohnWatsonRooney Cheeeeers!! I cannot wait to give it go!
I have learned a lot from your videos. Can you do any type of tutorial on report generation for the scrapes. My main use case is once I identify a page that meets my requirements, I generate a PDF (or something) that would show the page as it was. I've had terrible luck with htmltopdf and similar libraries (or point me in the right direction). Thanks for what you do!
Are you after just a visual representation of the page? Playwright can do that very easily. Or are you grabbing data and want that in PDF sorry not quite sure what you mean!
@@JohnWatsonRooney visual representation as far as I can tell (use case is still in the works/fluid). Once an item/listing on the page meets a requirement, save that individual info to a pdf, run some more stuff, then on to the next item/listing. Due to the subject matter, I don't want to put more in the comments, but yeah I'm learning a lot here and it's all going to work on a non-profit I run in the US.
@@JohnWatsonRooney I will look at playwright as well!
Hey John I was wondering, is it possible to fill up visa card dynamic form with selenium or playwright?
I don’t know that on specifically but I’ve filled out loads of forms with playwright and selenium before, if it loads the page fine you’ll have access to the forms to j out data
Hey, I was wondering why you stopped using Scrapy? Was it too big of a framework for the scraping projects you do?
Great video as always!
I found that i preferred to write my own solutions from the ground up with what I was trying to do, scrapy is still a great framework though. I have a video on my channel about it if you are interested in more details
Is this M+ 1M font you use in your ide? very nice and readable
Yes it is- although I think it’s m+ 2m. It’s great I’ve been using it for a while now
Hello, I a newby at scrapping. When I wrote @Dataclass it did not let me do it, it says it is not an integrer. I using python 3.12, httpx, selectolax and rich. Ase you mentioned in the tutorial
hi john, what editor are you using in this video?
This is neovim with the oxocarbon theme
can you give example when select by class?
Class is separated by a dot “div.class”
What editor are you using?
Neovim
U are innocent programmer ❤
Hello, following your tutorial, I am getting a enrror on line 26
resp = client.get(url, headers=headers)
Traceback (most recent call last):
File "", line 1, in
resp = client.get(url, headers=headers)
NameError: name 'client' is not defined
Early morning web scraping lesgo
Hi John, I tried turning your header code into this for macOS
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.99 Safari/537.36"
}
I use Google Chrome for web scraping, use M1 Chip and use macOS Ventura 13.4, how can I make it compatible for my scraping
Hi - the user agent header is what we send with the request to the website - it can be anything, you can use the same one I do or any that you can find on google. It doesn’t need to match your system
@@JohnWatsonRooney Would it cause an error if I write the same code that is not configured to my system requirements
Hi, Bro how are you.?
What do you do when the elements you want have dynamically changing classes like class="xJdnxidXjejns xIdhdn39db xzIJhdidmn8"
go back up the element tree until you find one that is constant, then reference off of that. I use css selectors so something like "div.constantclass li a" for all the a tags within li tags in divs with class "constantclass"
@@JohnWatsonRooney would really love a tutorial on this... and if you made something similar to this dynamic Changing classes can you link me? I'm at my wits end btw superb content manh. its helping me learn python deeply too
I'm getting an error on page 20 and it's consistent, but the products seem to vary each time the page appears, so they must be getting unordered data from their SQL statement.
File "C:\Users
icha\AppData\Local\Programs\Python\Python310\lib\ssl.py", line 1132, in read
return self._sslobj.read(len)
TimeoutError: The read operation timed out
It looks like the last line from your code to be executed was this:
File "C:\Users
icha\PycharmProjects\webScraping\JohnWatsonRooney\ModernScrapingBestTools.py", line 28, in get_page
resp = client.get(url, headers=headers)
File "C:\Users
icha\PycharmProjects\webScraping\venv\lib\site-packages\httpx\_transports\default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out
It happens consistently on www.rei.com/c/backpacks?page=20 but the number of products printed seems to vary before the error occurs.
Do you have any debugging suggestions?