Thanks you!!! excellent video that really helped when trying to figure out puppeteer, moreover recursion! I did find that the count in recursion didn't like numbers over 9 so i added these two lines to account of any pagination number. ``` const digit = currentPageNumber.toString().length; const newStreet = street.slice(0, -digit); ``` thanks again for a well timed video that saved the day :)
You mean because it is used in the beginning of the function? The function is not run until it is called. At *const partners = extractPartners(firstUrl)*, that’s when we need browser to be defined. And it has been just above. The code is not run from top to bottom!
Yeah! because my thought was that your exractPartners would need to know what browser is as it is evaluated. But I'm happy to be proven wrong - it's a great way to learn. I really enjoyed this web scraping series. Hope the fika was tasty ;)
Great tutorial, thank you so much for sharing! I am wondering how to design the function to stop when a certain length of found products has been reached (e.g. when 50 total partners are found, stop the recursion and proceed to other parts of the code) ?
David, great video. As for that h1 tag... they have a history of funny h1 tags on these landing pages. A little over a year ago, before the "360" rebranding changed their marketing site, I was looking at how they formatted their markup for SEO on one of their product pages. I noticed that the h1 tag was in the markup and said, for example, "Google Tag Manager...", but it was not visible to the user. If I remember correctly, on desktop the h1 tag had display:none attached to it. Then, once the hamburger menu breakpoint was crossed, it was still display:none; until you opened the menu, at which point display:none was removed and the h1 tag was wrapped around an img element with an image of the stylized "Google Tag Manager..." The actual text "Google Tag Manager..." in the h1 tag was hidden with CSS and probably used as a fallback. After some researching on Matt Cutts blog I found out that this is semi-okay to do.
David, please bring back the music when you timelapse :) Interested to see where this project is going. Keep it up, always looking forward to the next episode of this series.
Thanks! Hmm silly-questions-section here : the first rule of scraping is "be nice", dont overload servers etc , wouldn't it be nicer if we first copied all result pages and scraped them locally? what's the general approach?
Could use the http request status code to stop the recursion... Could probably also create another instance of the puppeteer that runs paralell to check if there is a next page, instead of using the same instance, would perhaps double the speed.
Regarding the first suggestion the site still returns 200 so that won’t work in this particular case. If we were to do this on thousands of pages and multiple sites - yes that’s a cool idea. At this stage though I think that is a bit too much overoptimization.
I'd like for you to deploy this (maybe to Firebase hosting, using Firebase Cloud Function). You probably would come up with an annoying CORS error, so I'd be interested to see how you resolve it. For myself, following the CORS tips in the Firebase Cloud Functions docs doesn't seem to help with web scraping with Puppeteer. :(
hello i have other basic python web scape code that saves to csv file and so what is the added code so we can save to csv file plz ? Lisa and thank you
Hi David! On my pagination Url has no page parameter. Is there any way to scrape for Ajax response? The required content is loading from AJAX/ Client-side.
Why call a variable X instead of Y? It is just one way of solving it. There are thousands of ways. Here we were lucky the pattern was so simple, for the next site it may not be. Sure though, it could be done differently, using regex for this exact example was perhaps slightly overengineered. In programming there is never only one way of solving something. I like to not stuff parameters into the function, think it looks neat and is simple to understand what’s going on when browsing through the code. By passing the URL it is simple to scan and understand “aha the function will use that URL and get partners out of it”.
Did the same thing with another web site. All is same. But sometimes it returns me an empty array [ ] , sometimes it scrapes only 10 pages even there are 14 pages. Why is it so? I am so tired.
I have a serious, and only slightly related question. The truth is I am not a coder, I am renting software via ParseHub. I can use the software just fine, but the website I am using to scrape, despite having tens of thousands of desired results, has a page limit of 15. There is no way I can get the amount of information I need from such small scrapes. Is there anyway to bypass this page limit and gain access to the totality of the actual results, oppose to pitiful amount I am actually able see at this time?
www.bbb.org/search?find_country=USA&find_latlng=41.308563%2C-81.051155&find_loc=Nelson%2C%20OH&find_text=Contractors&page=14&touched=9 This is the kind of result I am talking about.
Hello everyone, I think it does not work anymore. The class '' Compact '' is no longer there. How to fix that? I try with '' Landscape '' and it returns me an empty array in any case.
I would have just gotten the 404 page, used that to know the amount of pages, fetched all in parallel. Not really sure why recursion would be needed here.
Hi David. Can you please address the legality around the concept of web scaping. I mean, after I watched your last video with Matius, I got really excited and did some examples of my own. However i later found out that web scraping can carry legal consequences if done wrong. So I read the terms of use of a few websites and i found out that web scraping is prohibited in all of them. So can you also advise us on how to use this properly because we can go to jail because of ignorance. Otherwise thanks for the videos
It depends heavily per jurisdiction. In Sweden even personal information like how much you earn, where you live etc, is public, so we are pretty used to that. I'm sure it is more restricted in other countries. As we argued in the previous video, if the content is there in the public domain, it ought to be available to anyone, server or human alike. Still any publisher is of course allowed to do what they please. If they want to block your IP because you drain their resources or do things they suspect are not OK, they have all rights to do that (I presume!).
Cool. So just to be clear, breaching a company's terms of use will not result in a "cyber-crime" prison sentence or a fraud charge? So its safe to say the worst that could happen, apart from being sued for copyright infringement, if you reuse the content, is getting blocked?
I can't give legal advice. I don't know where you're located. It depends on the jurisdiction. But definitely you should beware of the terms of use. Often this is common with APIs. Many allow for a lot of fun things... Until you read the terms of use for the API. :( Swedish sites typically do not have terms of use (I don't know if it is implied through our constitution somehow) so Mattias and I are not very used to that even being an issue.
Can you do a video on how to track a user's (maybe of a your website) exact location using either IP, MAC address or any other way, except the lame geolocation from javascript which requires use permission. Please dont just get the address of the server farms which is as far as i went... Please try to get the user's exact location, like when we use google Earth, we must see the user's house or office, depending where... Awesome videos, You know Im a subscriber!!!!
So let me see if I have this correct. You want to be able to see a users exact location (house/office/whatever) without them granting express permission that allows you to do so? I don't think so.
Ah maybe you don't think on my level brah... you can have sensitive information and not misuse it. I do have their passwords if they log into my website don't be dumb brah
Sure it would work. It's the never-ending discussion "what is best, a loop or a recursion". Many argue loops are easier to comprehend and that's why you should use them. But that also has had the effect people rarely use recursion and do not understand it. The purpose of the video is to explain recursion through an example you can relate to - instead of the traditional Fibonacci numbers example. Who uses that in the real world? JavaScript is a functional language and the recursive approach fits better, in my opinion. Haskell, also a functional language, doesn't even have loops, you have to recurse. (It is still a 200 OK response in this example. I take it you mean when reaching a page with no items in it.)
@@OfficialDevTips I had to write a recursive and a non-recursive function that prints a binary tree for a tech interview. Writing that without recursion was surprisingly hard. I always just defaulted to recursion on that one.
If there is no page=1 or page=2 . ? and paginating with (javascript:GoPage(2) or javascript:GoPage(3)) how can i scrap ? (jobs.bdjobs.com/jobsearch.asp?fcatId=1&icatId=) please anyone have any idea about that then suggest me. i stuck here.. Thanks in advance
Using a regular expression to pull out the page number seems a little odd to me rather than simply passing the function a base URL and the initial page number. Still, a great video.
Kinda just imagining their SEO/Analytics folks freaking out at the web hits all over their site! Just to keep developers on their toes, send over an IE6 web user agent. (I'm kidding, don't do that)
These two web scraping vids are awesome! Would love to see one on building a crawler 🕸
Whats the diffrence btw scarping and crawler?
@@internet4543 did you do something like that? im trying to do that
@@hafidooz8034 I think crawler doesnt get the content just hit urls, not sure
I would have used the “next” button at the navigation and use its href to get the next page until there are no more next pages
Great idea!
@@OfficialDevTips yeah, that would make sense since a lot of sites have a different strategy for pagination :)
@justvashu how to get that total count?
Thanks you!!!
excellent video that really helped when trying to figure out puppeteer, moreover recursion!
I did find that the count in recursion didn't like numbers over 9 so i added these two lines to account of any pagination number.
```
const digit = currentPageNumber.toString().length;
const newStreet = street.slice(0, -digit);
```
thanks again for a well timed video that saved the day :)
.. and you just answered my question on the previous video! Thanks! I enjoyed so much this two on web scraping.
Love this video - learned so much and the guys are entertaining to listen to. Thanks
Im impressed that you didnt get an error saying 'browser is not defined'!
You mean because it is used in the beginning of the function?
The function is not run until it is called.
At *const partners = extractPartners(firstUrl)*, that’s when we need browser to be defined. And it has been just above.
The code is not run from top to bottom!
Yeah! because my thought was that your exractPartners would need to know what browser is as it is evaluated.
But I'm happy to be proven wrong - it's a great way to learn. I really enjoyed this web scraping series.
Hope the fika was tasty ;)
Nice tutorial and very well explained
Thank you so much David for this amazing scrapping video.
this is amazing thnk you so much xxxx
Great tutorial, thank you so much for sharing! I am wondering how to design the function to stop when a certain length of found products has been reached (e.g. when 50 total partners are found, stop the recursion and proceed to other parts of the code) ?
Awesome stuff!
Amazing thanks.
Awesome lesson, really practical
David, great video. As for that h1 tag... they have a history of funny h1 tags on these landing pages. A little over a year ago, before the "360" rebranding changed their marketing site, I was looking at how they formatted their markup for SEO on one of their product pages. I noticed that the h1 tag was in the markup and said, for example, "Google Tag Manager...", but it was not visible to the user. If I remember correctly, on desktop the h1 tag had display:none attached to it. Then, once the hamburger menu breakpoint was crossed, it was still display:none; until you opened the menu, at which point display:none was removed and the h1 tag was wrapped around an img element with an image of the stylized "Google Tag Manager..." The actual text "Google Tag Manager..." in the h1 tag was hidden with CSS and probably used as a fallback. After some researching on Matt Cutts blog I found out that this is semi-okay to do.
Thank u for this awesome video
Great Vid! You guys should go over Docker next
Subscribed!
Thanks!
David, please bring back the music when you timelapse :) Interested to see where this project is going. Keep it up, always looking forward to the next episode of this series.
Yeah cool! I’ll try doing that more - it just takes time so I try to get something out even though I don’t have the time to add the finishing touches
Would love to see how you would save the data into a JSON or .txt file or even FireBase
This is what while loop is used for
Thanks!
Hmm silly-questions-section here : the first rule of scraping is "be nice", dont overload servers etc , wouldn't it be nicer if we first copied all result pages and scraped them locally? what's the general approach?
Could use the http request status code to stop the recursion...
Could probably also create another instance of the puppeteer that runs paralell to check if there is a next page, instead of using the same instance, would perhaps double the speed.
Regarding the first suggestion the site still returns 200 so that won’t work in this particular case.
If we were to do this on thousands of pages and multiple sites - yes that’s a cool idea. At this stage though I think that is a bit too much overoptimization.
I'd like for you to deploy this (maybe to Firebase hosting, using Firebase Cloud Function). You probably would come up with an annoying CORS error, so I'd be interested to see how you resolve it. For myself, following the CORS tips in the Firebase Cloud Functions docs doesn't seem to help with web scraping with Puppeteer. :(
hello i have other basic python web scape code that saves to csv file and so what is the added code so we can save to csv file plz ?
Lisa and thank you
Hi David! On my pagination Url has no page parameter. Is there any way to scrape for Ajax response? The required content is loading from AJAX/ Client-side.
Why all the regex stuff over just passing the page number as an argument and creating the URL in the method?
Why call a variable X instead of Y? It is just one way of solving it. There are thousands of ways.
Here we were lucky the pattern was so simple, for the next site it may not be. Sure though, it could be done differently, using regex for this exact example was perhaps slightly overengineered.
In programming there is never only one way of solving something. I like to not stuff parameters into the function, think it looks neat and is simple to understand what’s going on when browsing through the code.
By passing the URL it is simple to scan and understand “aha the function will use that URL and get partners out of it”.
I would say "aha next page" but not "aha next url", you took a worse solution and now you have to explain why it is good.
Did the same thing with another web site. All is same. But sometimes it returns me an empty array [ ] , sometimes it scrapes only 10 pages even there are 14 pages. Why is it so? I am so tired.
While the cat's away the fika comes out to play.
I have a serious, and only slightly related question. The truth is I am not a coder, I am renting software via ParseHub. I can use the software just fine, but the website I am using to scrape, despite having tens of thousands of desired results, has a page limit of 15. There is no way I can get the amount of information I need from such small scrapes. Is there anyway to bypass this page limit and gain access to the totality of the actual results, oppose to pitiful amount I am actually able see at this time?
www.bbb.org/search?find_country=USA&find_latlng=41.308563%2C-81.051155&find_loc=Nelson%2C%20OH&find_text=Contractors&page=14&touched=9 This is the kind of result I am talking about.
thanks :like
Hello everyone, I think it does not work anymore. The class '' Compact '' is no longer there. How to fix that? I try with '' Landscape '' and it returns me an empty array in any case.
Which text editor are you using?
It's Visual Studio Code.
I think u could simply use a while loop until the function returns an empty array
I would have just gotten the 404 page, used that to know the amount of pages, fetched all in parallel. Not really sure why recursion would be needed here.
Hi David. Can you please address the legality around the concept of web scaping. I mean, after I watched your last video with Matius, I got really excited and did some examples of my own. However i later found out that web scraping can carry legal consequences if done wrong. So I read the terms of use of a few websites and i found out that web scraping is prohibited in all of them. So can you also advise us on how to use this properly because we can go to jail because of ignorance. Otherwise thanks for the videos
It depends heavily per jurisdiction. In Sweden even personal information like how much you earn, where you live etc, is public, so we are pretty used to that. I'm sure it is more restricted in other countries. As we argued in the previous video, if the content is there in the public domain, it ought to be available to anyone, server or human alike.
Still any publisher is of course allowed to do what they please. If they want to block your IP because you drain their resources or do things they suspect are not OK, they have all rights to do that (I presume!).
Cool. So just to be clear, breaching a company's terms of use will not result in a "cyber-crime" prison sentence or a fraud charge? So its safe to say the worst that could happen, apart from being sued for copyright infringement, if you reuse the content, is getting blocked?
I can't give legal advice. I don't know where you're located. It depends on the jurisdiction.
But definitely you should beware of the terms of use. Often this is common with APIs. Many allow for a lot of fun things... Until you read the terms of use for the API. :(
Swedish sites typically do not have terms of use (I don't know if it is implied through our constitution somehow) so Mattias and I are not very used to that even being an issue.
Oh ok cool... Thanks for the videos!!!
Are you from eastern europe?
Sweden
@@ksubota I like your name. The best das in the week 🙂
Hi david, i guess it''d be more interesting and catchy if you add some sound effects to the intro ;)
Can you do a video on how to track a user's (maybe of a your website) exact location using either IP, MAC address or any other way, except the lame geolocation from javascript which requires use permission. Please dont just get the address of the server farms which is as far as i went... Please try to get the user's exact location, like when we use google Earth, we must see the user's house or office, depending where... Awesome videos, You know Im a subscriber!!!!
So let me see if I have this correct. You want to be able to see a users exact location (house/office/whatever) without them granting express permission that allows you to do so?
I don't think so.
Oh cool... not for perverted reasons... solely with the intention of improving user experience
lol would you like their passwords and door keys as well?? WTF dude there are reasons this is NOT possible.
Ah maybe you don't think on my level brah... you can have sensitive information and not misuse it. I do have their passwords if they log into my website don't be dumb brah
@@GifCoDigital btw it is possible using mac address
Just use a set, each time you update the set with the pages new to the list
10:23 when your code runs in first attempt without any error.
Use your regex capture group to get the url before the page number - not hard coded
Great suggestion!
How so?
The meat of this is something like: _return (extractedParners.length) ? partner.concat(extractParners(id++)) : []_
Wouldn't an infinite loop work that breaks when there is a 404?
Sure it would work. It's the never-ending discussion "what is best, a loop or a recursion". Many argue loops are easier to comprehend and that's why you should use them. But that also has had the effect people rarely use recursion and do not understand it. The purpose of the video is to explain recursion through an example you can relate to - instead of the traditional Fibonacci numbers example. Who uses that in the real world?
JavaScript is a functional language and the recursive approach fits better, in my opinion. Haskell, also a functional language, doesn't even have loops, you have to recurse.
(It is still a 200 OK response in this example. I take it you mean when reaching a page with no items in it.)
DevTips Thanks for the answer. An infinite loop solution would have been a boring video :)
Also I expect to crawl nested structures later on. Like a category tree structures. Then it would also be more difficult with the loops.
@@OfficialDevTips I had to write a recursive and a non-recursive function that prints a binary tree for a tech interview. Writing that without recursion was surprisingly hard. I always just defaulted to recursion on that one.
Yoo can you guy do basic Java programs.. I studying it in school
lol change schools!! Dont waste your time learning that crap. And definitely dont ask a JavaScript dedicated channel to teach it.
Great videos! Does anybody know a channel that is fun like this one ir fun fun function but that uses Python?
If there is no page=1 or page=2 . ?
and paginating with (javascript:GoPage(2) or javascript:GoPage(3)) how can i scrap ? (jobs.bdjobs.com/jobsearch.asp?fcatId=1&icatId=)
please anyone have any idea about that then suggest me. i stuck here..
Thanks in advance
Using a regular expression to pull out the page number seems a little odd to me rather than simply passing the function a base URL and the initial page number. Still, a great video.
I'm in a Regex Anonymous. I use it to make coffee.
LOL
Kinda just imagining their SEO/Analytics folks freaking out at the web hits all over their site! Just to keep developers on their toes, send over an IE6 web user agent. (I'm kidding, don't do that)
For someone who wants views you're awfully careless with our time whistling in the beginning of the video, then going straight to an ad..
Haaahaaa code is ugly