Web Scraping with Python - Get URLs, Extract Data
Вставка
- Опубліковано 17 жов 2023
- Join the Discord to discuss all things Python and Web with our growing community! / discord
This is the third video in the series of scraping data for beginners. We're going to add functionality to scrape from the actual product pages rather than just the search page. Adding in dataclasses will also help us handle our data.
This is a series so make sure you subscribe to get the remaining episodes as they are released!
If you are new, welcome! I am John, a self taught Python (and Go, kinda..) developer working in the web and data space. I specialize in data extraction and JSON web API's both server and client. If you like programming and web content as much as I do, you can subscribe for weekly content.
:: Links ::
Recommender Scraper API www.scrapingbee.com/?fpr=jhnwr
My Patrons Really keep the channel alive, and get extra content / johnwatsonrooney (NEW free tier)
I Host almost all my stuff on Digital Ocean m.do.co/c/c7c90f161ff6
I rundown of the gear I use to create videos www.amazon.co.uk/shop/johnwat...
:: Disclaimer ::
Some/all of the links above are affiliate links. By clicking on these links I receive a small commission should you chose to purchase any services or items. - Наука та технологія
Thank you please continue this series
there's at least one more
Excellent video series, much appreciated. Thank you for posting.
John, you've made me re-enjoy scraping. I gave up due to how frustrating most tutorials are and the lack of real-world application with all of those stupid scraping demo sites. Thanks for all you do man
Another great presentation! Neat use of kwargs. Also, a very relevant use of data classes.
Excellent video, great learning experience
Thank you! Cheers!
This is very helpful! I appreciate it a lot.
you are genius man, thank you very much
Thank you very much big John!
You are very welcome
thank you! we need more of this sh!t
and i hope a serie like this of BeatifulSoup either
"parse_page(html)" from lesson 2 suddenly became "parse_search_page(html: HTMLParser):" in lesson 3 without any explanation. Anyway great tutorial as well as a whole series. Very good for beginners.
very very good
Hi kindly make a video of python with Selenium because no updated chrome driver available so I don't know how we run script now.
Thanks
From this video is not understandible for beginners, untill you decided for some reason to change all the code
if we can combine playwright with this, then basically we can scrape any dynamic sites? (e.g: social media websites)
thank you so much John this series is very fulfilling.
Essentially yes. This is why I separate out the parsing to the request, dropping playwright or selenium in is easy
Great video! Question: How can I find the extension that provides you with the errors next to the code?
Ohayou ❤
Man, your videos are great. Your videos on playwirght have really been helpful. I was able to follow your videos and then make my own playwirhgt script in my project. Until I got stuck dealing with dynamic pop-ups. I am unable to get past those. I am supposed to enter a piece of data in those pop-ups (not like captcha stuff). Just unable to make it work. It would help if you could cover dealing with dynamic pop-ups. Thanks.
Nice job is there a way to put this whole stuff in a cron job or scheduler to run intermittently
Good series! Personally I think the yield is a nice touch but probably not needed here based on the weight of the script (and the generator itself doesn't help iteration as was described as the reason for its inclusion), the dataclass is overkill vs a dict (we end up converting out to dict anyway), and so is **kwargs vs a single kwarg that defaults to something like False or None (gives an impression there may be more than a single kwarg, easier just to use a single one that defaults to a value when not passed in). Got a subscribe from me, thank you :)
Thanks - all valid points, I think I was guilty of trying to shove as many things that you can use into a script that doesn’t need them, for demonstration purposes
Can you show how we can do this on websites where we have to log in first?
Also kindly add the product urls column for each product and make it clickable when writing to CSV
Based on one of your previous videos figured out, how to get nested objects from tricky div's . Thank you!
Could you please advise, how in function below do I get not only 's but also 's, 's and 's elements?
Should it be some sort of pipe like syntax "div.article-formatted-body > div > p | h2 | pre | ul | li |"?
def read_article(html):
article_body = html.css("div.article-formatted-body > div > p")
paragraphs = [i.text() for i in article_body]
print(*paragraphs, sep='
')
Hi John, what is the fastest scraper for webpage with dynamically loaded content. I am using selenium and find it very slow in terms of speed. Any other options?
Great video! You've got a subscriber. After trying out the code a couple of times, I came across ReadTimeout error. How do we fix that?
Beautiful job. How can I find the code?
Shouldn't item number an integer and price being float?
ideally you want decimal for price. I tend to leave them as a string until i know how i want to handle them
thanks heaps for these John, can we please get the code into a pastebin or something pls? 🙏
can you stop smashing your keyboard