Industrial-scale Web Scraping with AI & Proxy Networks

Beyond Fireship

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 23 кві 2023
Learn advanced web scraping techniques with Puppeteer and BrightData's scraping browser. We collect ecommerce data from sites like Amazon then analyze that data with ChatGPT.
#javascript #datascience #chatgpt
Get $10 Credit for BrightData get.brightdata.com/fireship
Puppeteer Docs pptr.dev

КОМЕНТАРІ • 624

@beyondfireship Рік тому ⁺¹³⁵
Use this link to get a $10 credit, enough cash to scrape thousands of pages get.brightdata.com/fireship
@DeanDavisMarketing Рік тому ⁺³
❤
@Reddblue Рік тому ⁺²⁶
This man selling wood and iron to shovel makers
@anze Рік тому ⁺⁵
@beyondfireship ad link doesnt work
@NoahKalson Рік тому
@@anze worked for me. Try now.
@tamasmajer Рік тому ⁺⁴
The pricing page says 20$/GB. I checked how big the pricing page was it loaded 4MB, so then it costs 20$ for 250 pages? That seems very expensive. Or how should i calculate the price?
@rvft Рік тому ⁺¹¹²¹
I like how he didn't use "cheap" during the entire video because my god the pricing is absolutely madness on the advertised product
@brunopanizzi Рік тому ⁺¹⁷⁸
Industrial scale!!!
@koba2160 Рік тому ⁺⁶²
scraping aint cheap, but theres many ways to make it much cheaper
@mrgyani Рік тому ⁺²⁴
@@arteuspw what do you mean by 1gb/$1? You mean browsing 1gb of data for a dollar with a single proxy?
How many proxies do you get for $1?
@user-kj2kt8jt4n Рік тому ⁺²⁸
@@arteuspw Please tell me where to buy them at this price.
@mantas9827 Рік тому ⁺¹²
Is 20$ per GB considered expensive? I wonder how much could you scrape from a site like amazon for that GB... surely a lot ?
@albiceleste101 Рік тому ⁺⁷⁵²
As a freelance dev I get contacted all the time for scraping, it's definitely one of the most requested along with Wordpress (which I also dont work with)
@cymaked Рік тому ⁺⁵⁷
interesting - 8 years of freelancing and never had one such request 😮
@dinoscheidt Рік тому ⁺¹⁷¹
And with a freelancer, the business has the advantage that YOU break the terms and conditions of the companies you scrape (are legally liable and suable). Not the business 😊 so a cheap code monkey and legal scape goat all in one 💪
@mrgyani Рік тому ⁺³
Where do you get these projects from?
@VividCoding Рік тому ⁺³
@@dinoscheidt Wait can they really do that? They are the ones who wanted to scrape the data in the first place.
@dabbopabblo Рік тому ⁺⁶⁹
I'm not even a freelancer and I cant count the number of times on two hands Ive been asked to make someone a website. They think because I'm a web developer I am just some guy who goes around making websites willy nilly. And the few times I have actually went through with helping someone out, they want everything Wix or Wordpress provides and have the audacity to suggest I shouldn't be asking so much in pay when a drag and drop builder can suffice.. THEN USE THE BUILDER GOD DAMMIT. My knowledge is wasted on front end work anyways.
@YuriG03042 Рік тому ⁺⁵⁹
toward the end of the video, Jeff suggests that you can grab all the links and then make requests to those links. it gave me flashbacks of another video on the main channel where a company did this and ended up with a 70k+ GCP bill after one night of web scraping, because their computing instance was forever recursing and was scalable up to 1000 instances lmao
@alexcasillas2488 Рік тому ⁺⁴¹
This reminds me of when I solved 100 captchas manually so that I could download some data files from a website for an ai. I got a sever message temporarily banning me from the website saying that I must be a bot. I learned my lesson and stuck to only solving 99 captchas each day from then on until I had enough data files
@Autoscraping 3 місяці тому ⁺⁴
An extraordinary piece of video material that has proven highly useful for our new team members. Your generosity is immensely appreciated!
@EliteGamerpk Рік тому ⁺¹⁹³
As a web scraping tool developer, one thing to note about the chatGPT code about extracting product names etc is that it's not going to work on all cases. What I mean by that is we can see there are some random class names like '._cDEzb'. And these classes can vary from page to page. So your code for one listing page, might not work for other. The way I do this is using some advanced query selectors that don't rely on unreliable classes. Can go into more detail if required.
@CrackedPlayz Рік тому ⁺¹⁴
Please do!
@RiChYFanatics Рік тому ⁺¹¹
Dont be shy :p
@myhitltd5826 Рік тому ⁺⁵
so that's why I copy full selector of the element and work with it in puppeteer.
@MrNsaysHi Рік тому ⁺²
AFAIK puppeteer doesn't support finding elements by xpath, so what do guys use?
@thrand Рік тому ⁺²
@@MrNsaysHi well, real men write their own html parser and query language. But peasants like myself use css selectors with document.querySelectorAll.
@meansnada Рік тому ⁺¹⁹⁸
I love how there are legit businesses to bypass captchas and mess up with data :)
@dislike__button Рік тому ⁺¹⁶
Scraping isn't illegal
@Tylersmodding Рік тому ⁺²
and individuals
@aresakmalcus6578 Рік тому ⁺¹
@@dislike__button if it's against Terms of Service of the given site, it is
@Bruceylancer Рік тому ⁺²⁵
@@aresakmalcus6578 I'm not a lawyer, but how can it possibly be illegal? It can be against ToS, sure, then the website owners can surely act accordingly, i.e. ban your account on the said website, ban your IP address, and so on. But illegal? Are there any laws out there that prohibit collecting public data? Are there any cases of people getting sued for scraping? I haven't heard of such, maybe you can provide some examples. Also, there are 8-figure businesses built on scraping, like Ahrefs or Semrush.
@Bruceylancer Рік тому ⁺²
@@Andrew-zy7jz Exactly! Very good example.
@Maneki-Nico Рік тому ⁺⁹
Your videos are somehow exactly relevant to the code I am writing every week - interesting for sure!
@xanderbarkhatov Рік тому ⁺¹¹⁵
If I'm not mistaken, page.waitForSelector(selector) already returns the element handle, so you don't need to use page.$(selector) after that.
Anyway, great video, as always.
Thank you! ❤
@yvanguemkam4739 Рік тому ⁺⁶
You're right, wanted to said that... But don't have money to spend on the browser. Is there an alternative?
@cyberzjeh Рік тому ⁺⁸
@@yvanguemkam4739 you can host puppeteer yourself and pay for a proxy service if you need it, might come out cheaper (but more work obviously)
@Loubensdoriscar 3 місяці тому ⁺¹
Zeus Proxy's specific emphasis on session management is a key factor that resonates with my goal of executing data retrieval tasks with a focus on mimicking genuine user behaviors.
@yashkhd1100 Рік тому ⁺⁴¹
To be frank out of all youtubers Fireship has most interesting and to the point videos and gives most value out of time spend. Kind of just wondering how he keeps track of all the varied topics and able to make most out of it.
@julienwickramatunga7338 Рік тому ⁺⁸
He already has five prototypes of Neuralink chips plugged into his brain, linked to the Web via 5G, and he is using digital clones of himself (coded in JS of course) to make more video content (with the help of ChatGPT).
That makes him the most powerful being on the planet.
Praise the Cyber-Jeff! 👾
@RobinhoodCFO 2 місяці тому
With ChatGPT of course
@AdamBechtol Місяць тому
Mmm
@unknownlordd Рік тому ⁺⁶⁵
Web scraping is still my favourite type of projects it's so fun and "meaningful" to me and with the help of AI i can see it becoming much much easier
@0187 Рік тому ⁺¹⁰
same, gives me shitton of satisfaction
@GeekProdigyGuy Рік тому ⁺²
thanks Jesus
@alejandroarango8227 Рік тому ⁺¹
Unfortunately GPT4 is still too expensive to use in projects and gpt3.5 is still too stupid.
@unknownlordd Рік тому ⁺³
@@alejandroarango8227 it's stupid enough so you still do much of the work yourself cause eventually it's just a tool to help and personally it helps me enough
@unknownlordd Рік тому
@@0187 exactly what i feel
@Jeanseb23 Рік тому ⁺³
You've foiled my plan 5 years in the making. At least now I have a free 10$ credit for Brightdata to catch up. Thanks Fireship!
@gatonegro187 Місяць тому
how much did u end up spending
@DanielLavedoniodeLima_DLL Рік тому ⁺¹²
I remembered that web scrapping was a nightmare to deal with, specially doing this proxy rotation by ourselves. This tool is not cheap, though, so at least here in Brazil (and other emerging countries alike), companies will still be doing that like the old days. The captcha solving was actually done by real people at the time I worked in a company that mined those kind of data a few years ago, but I guess this can be automated with GPT-4 tools now
@abishekbaiju1705 7 місяців тому ⁺¹
Thanks for making this video. I am actually working on a project where the users can add amazon products and look for price changes and also get notified with price changes. My objective was to learn web scraping.
@user-bp9dx1ir7w 9 місяців тому ⁺¹
Thank you for teaching me puppeteer and bright data, beats all content on internet
@prabhavkhera4959 Рік тому ⁺¹³
Thanks Jeff. I was planning on building a project that uses web scraping and this video absolutely dropped at the perfect time. Appreciate it. I love your videos and hope for more such content in the future :)
@desertislanddivs Рік тому
This is a great spell for Howarts Ai Academy, Thanks Professor Fireship ^^
@nichtolarchotolok Рік тому
Been using puppeteer for a few yrs for freelance web scraping. Puppeteer and Playwright have been a saving grace in many circumstances.
@donirahmatiana8675 9 місяців тому
could you give some tips to not getting ip banned?
@nichtolarchotolok 9 місяців тому
@@donirahmatiana8675 puppeteer-extra library and the puppeteer-extra-stealth plugin. If that doesnt work, you'd need rotating proxy like that of bright data as mentioned in the video.
@jacekpaczos3012 4 місяці тому
@@nichtolarchotolok are you not using scrapy? I always thought of scrapy as the most convenient solution.
@nichtolarchotolok 4 місяці тому
@@jacekpaczos3012 I started off on the nodejs route and havent had the need to try the python way of doing this. I do remember trying scrapy in my early days but for some reason puppeteer felt more intuitive to me. That is probably because I felt more comfortable writing javascript code.
@Ruf4eg Рік тому ⁺¹
Man, you are reading my thoughts! this video came at the right time when I wanted to scrape some websites!!!!
@shawnvirdree8593 Рік тому ⁺³
Wow, you’re on the cutting edge of technology 🤯
@ikedacripps Рік тому ⁺⁷
When I first saw puppeteer when I was learning nodejs this is exactly the kind of use case I wanted to apply it to. Specifically wanted to scrape csv files and have some AI learn it and make some sense out of it. I think it’s now more than possible
@DemPilafian Рік тому ⁺¹²
Downloading CSV files would typically not be considered _"scraping"._ You don't have to scrape the data out of a CSV file -- it's already data.
@ikedacripps Рік тому
@@DemPilafian you just wanna falsify my statement but scraping for csv file is as valid as scraping for pdf files. I specifically wanted to scrape soccer analytics websites for those csv files. Hope that puts it into perspective for you .
@danieldosen5260 Рік тому
I never thought of returning data as JSON... that's obvious and brilliant...
@beefykenny Рік тому
This video
has a lot of value.
@felixmildon690 Рік тому
Best video yet thanks fireship. This will introduce me to puppeteer and the services BrightData offers (BrightDatas prices are a concern based on the comments section)
@CODE_YOUR_TYPE 3 місяці тому
I love you man i was trying for so long and you are the only one who gave the solution thank you so much
@abz4852 Рік тому ⁺²
fireship you are uploading videos faster than new javascript frameworks get released
@calmgee 7 місяців тому
This was gold
@d3layd Рік тому ⁺²
Thank you for this! I used ChatGPT to write a puppeteer script for me the other day and it was fucking slick
@KabbalahredemptionBlogspot 8 місяців тому
OK that was way cooler than I thought
@kinglane8634 Рік тому ⁺⁹
Thanks for always helping us devs keep out workflow clean and simple!!! If you plan on starting a subscription service I'd love to see what you're offering.
@trickster6254 Рік тому ⁺¹
He has got a website offering courses. I bought the Angular one myself and was really good.
@BharadwajGiridhar Рік тому ⁺¹⁸
One thing jeff is that these websites change css class names on every refresh. So it's better to write code with selectors that don't change like id or aria label.
@kasparsc Рік тому ⁺¹
Sir, you are a legend 🔥🔥🔥
@pythoneatssquirrel 7 місяців тому ⁺²
I have build hundreds of scrapers in both VBA and Python using Selenium. Everything can be done, this video it's just an ad for one of those hundreds of this kind of service providers.
@Jason-nv6ku Рік тому
You're amazing! Many thanks!
@blaizeW Рік тому ⁺¹
Another gold gem for daddy fireship 🤑🔥
@VaibhavShewale Рік тому ⁺³
damn that was really amazing, i was actually thinking of taking snippet of the page extract data then delte that page and repeat
@rstar899 Рік тому
Amazing video as always 🎉
@wandenreich770 Рік тому ⁺¹
Very insightful
@AbuBakar-pc2fp Рік тому
Awesome Explanation
@selimachour Рік тому ⁺³
I usually block the fetching of images, css, fonts (and javascript if the website can run without) which speeds up the page load by a lot!
@estebancordoba555 Рік тому ⁺²
In my country, some products are more expensive than amazon, I built a scrapper to get the products and price with params as the brand or names but amazon blocked me couple of times, this si really nice solution!
@danvilela Рік тому
Brooo, this is awesome!
@sebastianacostamolina9593 8 місяців тому
really cool
@aseluxestays 9 місяців тому ⁺¹
I'm here because I need to hire someone who can provide this service for me. Great video!
@TheHassoun9 4 місяці тому
Hi I'm willing to help# I'm a dev looking for commission
@NathanDodson Рік тому ⁺¹⁵⁰
See. This is why I watch all your videos, Jeff. I'm a super shit JS coder, but I'm pretty decent with Python. This gives me an idea for my own eBay business, and scouring those tool docs for Python SDKs to do the same thing. Honestly, it's been your videos that have kept me in the coding space. You always have these creative "concept/idea" videos and a good majority of them have me opening up VSC to do some tinkering. Thanks for all your content brother.
@priapulida Рік тому ⁺⁹
there's Pyppeteer
@maskettaman1488 Рік тому ⁺⁹
@BeBop No, it's Pyppeteer
@minhuang8848 Рік тому ⁺²
@@bebop355 *pyppeteer tho
@JGBreton 11 місяців тому ⁺³
did this materialize?
@tonymudau3005 10 місяців тому ⁺¹
@@JGBretonlmao 😂 asking myself the same thing
@EuricoAbel Місяць тому ⁺¹
Incorporating Zeus Proxy into your SEO strategy ensures efficient and effective monitoring and data gathering processes.
@hamza-325 Рік тому
I worked for a digital shelf company that scrap the data from Amazon and more websites. They use many proxy services but one of the most expensive ones was BrightData, so the more experienced workers always instructed us to not use BrightData unless it is really necessary.
@sciencenerd8326 11 місяців тому
what are the others that are better?
@hamza-325 11 місяців тому
@@sciencenerd8326 the company has made some cheap proxies using the machines of AWS for examples (they don't have many IPs but they do the job for many websites). And I think there are cheaper services like ProxyRack.
@fhnvcghj1587 6 місяців тому
@@hamza-325I have a task of selenium bot I have 1000 account but need 1 ip for each account to make request to the website and do the work any idea or paid service for that
@classmanOfficial Рік тому ⁺³
Selenium has a headless mode :) if you guys want to try it out, works well enough for multithreading
@daniamaya 11 місяців тому
Gold. Just pure gold.
@wlockuz4467 Рік тому ⁺²⁷
Remote browser as a service is actually a genius idea. Often times when you want to scrape at scale the most painful thing to do is hosting and using effective proxies.
But with this you can literally leave the scraper running on your machine and let brightdata take care of the proxies. You don't even need good specs because the browser runs on a different server.
@quickkcare605 Рік тому
Well thought!
@klapaucius515 Рік тому ⁺⁷
smells like ad
@wlockuz4467 Рік тому
@@klapaucius515 Do you mean that for my comment or the video?
@arrvee7249 9 місяців тому ⁺⁵
ikr, then you can just pay brightdata $10,000 and go on to make $52 for the data you've scraped.
@forbiddenera Рік тому ⁺⁷
Puppeteer is the source of non stop memory leak nightmares for me. Fortunately I got it down to under like 30mb a day but originally it was like 30mb per leak and like 250+mb a day leaked (and it was mostly only loading 2 pages back and forth)
@alejandroarango8227 Рік тому ⁺³
I avoid using it to the maximum, it is a waste of server resources.
@1337Booler 11 місяців тому
You could just close the browser and open a new one every time you use it to avoid memory leaks
@Kevgas Рік тому ⁺³
You should create a course on how to do this, Id pay for that!
@bossdaily5575 Рік тому ⁺¹⁹
Virgin API users vs Chad Web scrapers
@ehsanpo Рік тому
web scraping with ruby and rails is one of the best ways
@rid9 Рік тому ⁺¹
This feels like the kind of programming work a ferengi would be involved with.
@garywaddell6309 Рік тому ⁺¹
Brilliant
@KhaledAlMola 9 місяців тому
That is a cool website to use. I'll try it one day
@robertwitzke6134 11 місяців тому
great video!
@manfredcomplex366 Рік тому
Freaking Money Glitch. Love you man❤
@chaseclingman Рік тому ⁺⁶
I liked how you showed the timeout as 2 * 60 * 1000 so beginner friendly haha
@mrgalaxy396 Рік тому ⁺¹⁶
I mean that's way more readable than 1200000, this is a pretty common practice
@CandyLemon36 5 місяців тому
I'm impressed by the depth of this material. A book with corresponding themes was a key influence in my life. "AWS Unleashed: Mastering Amazon Web Services for Software Engineers" by Harrison Quill
@nskiran Рік тому
We used to user selenium web driver ( webactions) and phantomjs to scrape data.
Ip problems were solved with nohodo
In good olden days 2014 stack
@katykarry2495 Рік тому ⁺¹
can you share the code in the description? for us to test it and edit it to our own needs?
loving your videos!
@forbiddenera Рік тому
..while Puppeteer can run headless, you don't have to run it headless. It may still seem headless from what most might consider that term to mean but headless or not is a config option for Puppeteer, running with headless disabled can help beat bot detection sometimes.
@luxurycondobbmg Рік тому ⁺⁴
I remember my first time scraping a website - except back then, we didn't have ChatGPT proompts to do it for us. We had to physically read the documentation and actually understand the code we wrote
@adityag6022 10 місяців тому
Thank you sir
@gregheth Рік тому
Wow. Thanks
@rallysahil 2 місяці тому
Awesome !
@maxivy Рік тому ⁺²²
Awesome video - I will have to rewrite it in Python though ;) because I am a human bean
@NicolaiWeitkemper Рік тому ⁺⁴
BeautifulSoup is better anyways :P
@priapulida Рік тому ⁺¹
@@danielsan901998 or Pyppeteer
@NicolaiWeitkemper Рік тому
@@danielsan901998 Correct, that's not an even comparison. However: BeautifulSoup >> Cheerio
@MrKrzysiek9991 9 місяців тому
Microbots AI chrome extension helps with building prompt with HTML code included. Chech it out it you want to write automation code faster.
@UmanPC Рік тому ⁺¹
Great!!!
@daniel_q40 Рік тому
Data is the new gold
@v1s1v 7 місяців тому ⁺¹
Nice tutorial, but there are AI tools now like Kadoa that can do all of this for you. In the time it takes for you to watch this video, you can get an AI scraper up and running.
@JustBR0 Рік тому
Bright data is throwing their money!!
@Victor4X Рік тому ⁺³⁶
Stuff isn't censored properly at 3:00
But I assume those creds are temporary anyway
@cymaked Рік тому ⁺³
theres many videos on Fireship where he jokes about living dangerously and letting the cred be seen 😂 obv temp stuff
@thie9781 Рік тому ⁺²
@@cymaked or just F12 to let somebody waste their time
@wesleydunn169 2 місяці тому ⁺¹
Absolutely fascinated by this video on industrial-scale web scraping with AI and proxy networks! The way Puppeteer integrated with BrightData's scraping browser to extract ecommerce data was impressive. The utilization of ChatGPT for data analysis makes sense as a free and available option. Can't wait to see what Part 2 will unveil!
@avocadodip5740 2 місяці тому ⁺⁴
Bro is definitely a bot
@kevinbraga9526 10 місяців тому ⁺¹
Great video, i have a question for you, how do you know that this is the industry standard for modern web scraping?
Like how can you find out this information.
@kairee1093 11 місяців тому
thanks
@TheLime1 Рік тому
Good money making right there
@progamer1196 Рік тому
as soon as I saw the thumbnail I knew this was an ad for brightdata
@panther_puneeth Рік тому
went above head with such fast
@kevinbatdorf Рік тому ⁺¹²
some of those query selectors look like they’d break in a week. Maybe you need to add openai to the workflow more directly
@RichardHarlos Рік тому ⁺³
It's a proof of concept/tutorial, not an explicit recommendation for bulletproof boilerplate. Context, eh? :)
@yellowboat8773 Рік тому
Maybe outputting the html every time to openai then having that pick the query selector then insert into the script. Do have to be very specific with your prompt because it often replies with: The query selector is: a.carousel
@3rawkz Рік тому
Scrapy all day baby!
@harisonfekadu Рік тому
You're ingenuity is something else. It's devs like you that won't be replaced by AI.
@asperthickkgamerr491 Рік тому ⁺¹
cuz he is an AI
@felixmildon690 Рік тому ⁺⁸
Tutorial starts at 2:15
@AnshTiwari-fx2yq 3 місяці тому ⁺¹
May god bless you
@MalteBohmboehm Рік тому ⁺¹
Only for this topic alone its worth to learn python along with Scrapy
@tw-wp5uv Рік тому
Bright Data is quite expensive with average success rates for webistes with high protection measurements. keep that in mind if you want to scale your scraping
@exploringcrypto6609 Рік тому ⁺⁵
Jeff how can you process data so fast?
@SkySesshomaru Рік тому
o.o that's some impressive shit right there
@aimattant 9 місяців тому
have this
issue - SyntaxError: Cannot use import statement outside a module
@summonlucifer3603 4 місяці тому
If you use selenium to open a browser window you can easily scrape from any website
@parlor3115 Рік тому ⁺³
This Bright thingy alleviates most of the overhead of web scraping (IP rotation and captcha solving). And goodness GPT-4. Idk if I should feel happy about the potential prospects or anxious about how crazy good this thing has gotten.
@anderswesterlund4191 Рік тому
true king
@AP-lw4rw 8 місяців тому
I feel like a gangsta...finding ways around data collection for my business.
@oblivion_2852 Рік тому ⁺¹
Could we have a vid on the difference between Selenium and Puppeteer?
@Xld3beats Рік тому ⁺⁶
Guess its time to write a program that applies to every job on the internet
@SteveHazel Рік тому
pretty good "how to be a hacker" intro heheh.
@Dev-Siri Рік тому ⁺²
just as I thought the ai videos ended

Наступне

Автоматичне відтворення

The Biggest Issues I've Faced Web Scraping (and how to fix them)