Recursion ➰ for Paginated Web Scraping

DevTips

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 7 лис 2024

КОМЕНТАРІ • 86

@naansequitur 6 років тому ⁺¹⁴
These two web scraping vids are awesome! Would love to see one on building a crawler 🕸
@hafidooz8034 4 роки тому
Whats the diffrence btw scarping and crawler?
@pacopepe92 4 роки тому
@@internet4543 did you do something like that? im trying to do that
@Paltibenlaish 4 роки тому
@@hafidooz8034 I think crawler doesnt get the content just hit urls, not sure
@justvashu 6 років тому ⁺³⁰
I would have used the “next” button at the navigation and use its href to get the next page until there are no more next pages
@OfficialDevTips 6 років тому ⁺⁴
Great idea!
@PavelIsp 5 років тому ⁺³
@@OfficialDevTips yeah, that would make sense since a lot of sites have a different strategy for pagination :)
@aquibkadri4984 5 років тому
@justvashu how to get that total count?
@jolyonfavreau3160 4 роки тому
Thanks you!!!
excellent video that really helped when trying to figure out puppeteer, moreover recursion!
I did find that the count in recursion didn't like numbers over 9 so i added these two lines to account of any pagination number.
```
const digit = currentPageNumber.toString().length;
const newStreet = street.slice(0, -digit);
```
thanks again for a well timed video that saved the day :)
@simoneicardi3967 6 років тому ⁺¹
.. and you just answered my question on the previous video! Thanks! I enjoyed so much this two on web scraping.
@kasio99 5 років тому ⁺¹
Love this video - learned so much and the guys are entertaining to listen to. Thanks
@Joevanbo 6 років тому ⁺¹
Im impressed that you didnt get an error saying 'browser is not defined'!
@OfficialDevTips 6 років тому
You mean because it is used in the beginning of the function?
The function is not run until it is called.
At *const partners = extractPartners(firstUrl)*, that’s when we need browser to be defined. And it has been just above.
The code is not run from top to bottom!
@Joevanbo 6 років тому
Yeah! because my thought was that your exractPartners would need to know what browser is as it is evaluated.
But I'm happy to be proven wrong - it's a great way to learn. I really enjoyed this web scraping series.
Hope the fika was tasty ;)
@amanpreet-dev 4 роки тому
Nice tutorial and very well explained
@gmjitendra 5 років тому
Thank you so much David for this amazing scrapping video.
@Paltibenlaish 4 роки тому
this is amazing thnk you so much xxxx
@avecho123 4 роки тому ⁺¹
Great tutorial, thank you so much for sharing! I am wondering how to design the function to stop when a certain length of found products has been reached (e.g. when 50 total partners are found, stop the recursion and proceed to other parts of the code) ?
@TheUKFishingGuy 4 роки тому
Awesome stuff!
@kryptic100k3 3 роки тому
Amazing thanks.
@yjk22 6 років тому ⁺¹
Awesome lesson, really practical
@drewlomax7837 6 років тому
David, great video. As for that h1 tag... they have a history of funny h1 tags on these landing pages. A little over a year ago, before the "360" rebranding changed their marketing site, I was looking at how they formatted their markup for SEO on one of their product pages. I noticed that the h1 tag was in the markup and said, for example, "Google Tag Manager...", but it was not visible to the user. If I remember correctly, on desktop the h1 tag had display:none attached to it. Then, once the hamburger menu breakpoint was crossed, it was still display:none; until you opened the menu, at which point display:none was removed and the h1 tag was wrapped around an img element with an image of the stylized "Google Tag Manager..." The actual text "Google Tag Manager..." in the h1 tag was hidden with CSS and probably used as a fallback. After some researching on Matt Cutts blog I found out that this is semi-okay to do.
@AbhishekKumar-mq1tt 6 років тому
Thank u for this awesome video
@charlyecastro 6 років тому
Great Vid! You guys should go over Docker next
@codenikninja9814 5 років тому
Subscribed!
@caiolins2495 5 років тому
Thanks!
@Soundtech98 6 років тому
David, please bring back the music when you timelapse :) Interested to see where this project is going. Keep it up, always looking forward to the next episode of this series.
@OfficialDevTips 6 років тому
Yeah cool! I’ll try doing that more - it just takes time so I try to get something out even though I don’t have the time to add the finishing touches
@jordihoven2349 3 роки тому
Would love to see how you would save the data into a JSON or .txt file or even FireBase
@tanvorn9323 4 роки тому
This is what while loop is used for
@trendYou 4 роки тому
Thanks!
Hmm silly-questions-section here : the first rule of scraping is "be nice", dont overload servers etc , wouldn't it be nicer if we first copied all result pages and scraped them locally? what's the general approach?
@AntonKristensen 6 років тому
Could use the http request status code to stop the recursion...
Could probably also create another instance of the puppeteer that runs paralell to check if there is a next page, instead of using the same instance, would perhaps double the speed.
@OfficialDevTips 6 років тому
Regarding the first suggestion the site still returns 200 so that won’t work in this particular case.
If we were to do this on thousands of pages and multiple sites - yes that’s a cool idea. At this stage though I think that is a bit too much overoptimization.
@JBuchmann 5 років тому
I'd like for you to deploy this (maybe to Firebase hosting, using Firebase Cloud Function). You probably would come up with an annoying CORS error, so I'd be interested to see how you resolve it. For myself, following the CORS tips in the Firebase Cloud Functions docs doesn't seem to help with web scraping with Puppeteer. :(
@pjmclenon 4 роки тому
hello i have other basic python web scape code that saves to csv file and so what is the added code so we can save to csv file plz ?
Lisa and thank you
@sridharnetha6003 4 роки тому
Hi David! On my pagination Url has no page parameter. Is there any way to scrape for Ajax response? The required content is loading from AJAX/ Client-side.
@alexzanderflores4185 6 років тому ⁺⁹
Why all the regex stuff over just passing the page number as an argument and creating the URL in the method?
@OfficialDevTips 6 років тому ⁺⁶
Why call a variable X instead of Y? It is just one way of solving it. There are thousands of ways.
Here we were lucky the pattern was so simple, for the next site it may not be. Sure though, it could be done differently, using regex for this exact example was perhaps slightly overengineered.
In programming there is never only one way of solving something. I like to not stuff parameters into the function, think it looks neat and is simple to understand what’s going on when browsing through the code.
By passing the URL it is simple to scan and understand “aha the function will use that URL and get partners out of it”.
@arzievsultan 6 років тому
I would say "aha next page" but not "aha next url", you took a worse solution and now you have to explain why it is good.
@kainarilyasov4644 5 років тому
Did the same thing with another web site. All is same. But sometimes it returns me an empty array [ ] , sometimes it scrapes only 10 pages even there are 14 pages. Why is it so? I am so tired.
@Laek4 6 років тому
While the cat's away the fika comes out to play.
@logandarsee4433 5 років тому
I have a serious, and only slightly related question. The truth is I am not a coder, I am renting software via ParseHub. I can use the software just fine, but the website I am using to scrape, despite having tens of thousands of desired results, has a page limit of 15. There is no way I can get the amount of information I need from such small scrapes. Is there anyway to bypass this page limit and gain access to the totality of the actual results, oppose to pitiful amount I am actually able see at this time?
@logandarsee4433 5 років тому
www.bbb.org/search?find_country=USA&find_latlng=41.308563%2C-81.051155&find_loc=Nelson%2C%20OH&find_text=Contractors&page=14&touched=9 This is the kind of result I am talking about.
@congthanhinh7987 5 років тому
thanks :like
@JohnnyMylot 5 років тому
Hello everyone, I think it does not work anymore. The class '' Compact '' is no longer there. How to fix that? I try with '' Landscape '' and it returns me an empty array in any case.
@gauravthakur3085 6 років тому
Which text editor are you using?
@DrempDK 6 років тому
It's Visual Studio Code.
@constantinyt4845 4 роки тому
I think u could simply use a while loop until the function returns an empty array
@OmgImAlexis 4 роки тому
I would have just gotten the 404 page, used that to know the amount of pages, fetched all in parallel. Not really sure why recursion would be needed here.
@djsamke384 6 років тому
Hi David. Can you please address the legality around the concept of web scaping. I mean, after I watched your last video with Matius, I got really excited and did some examples of my own. However i later found out that web scraping can carry legal consequences if done wrong. So I read the terms of use of a few websites and i found out that web scraping is prohibited in all of them. So can you also advise us on how to use this properly because we can go to jail because of ignorance. Otherwise thanks for the videos
@OfficialDevTips 6 років тому
It depends heavily per jurisdiction. In Sweden even personal information like how much you earn, where you live etc, is public, so we are pretty used to that. I'm sure it is more restricted in other countries. As we argued in the previous video, if the content is there in the public domain, it ought to be available to anyone, server or human alike.
Still any publisher is of course allowed to do what they please. If they want to block your IP because you drain their resources or do things they suspect are not OK, they have all rights to do that (I presume!).
@djsamke384 6 років тому
Cool. So just to be clear, breaching a company's terms of use will not result in a "cyber-crime" prison sentence or a fraud charge? So its safe to say the worst that could happen, apart from being sued for copyright infringement, if you reuse the content, is getting blocked?
@OfficialDevTips 6 років тому ⁺¹
I can't give legal advice. I don't know where you're located. It depends on the jurisdiction.
But definitely you should beware of the terms of use. Often this is common with APIs. Many allow for a lot of fun things... Until you read the terms of use for the API. :(
Swedish sites typically do not have terms of use (I don't know if it is implied through our constitution somehow) so Mattias and I are not very used to that even being an issue.
@djsamke384 6 років тому
Oh ok cool... Thanks for the videos!!!
@4franz4 6 років тому
Are you from eastern europe?
@ksubota 6 років тому
Sweden
@4franz4 6 років тому ⁺²
@@ksubota I like your name. The best das in the week 🙂
@Djzaamir 6 років тому
Hi david, i guess it''d be more interesting and catchy if you add some sound effects to the intro ;)
@djsamke384 6 років тому
Can you do a video on how to track a user's (maybe of a your website) exact location using either IP, MAC address or any other way, except the lame geolocation from javascript which requires use permission. Please dont just get the address of the server farms which is as far as i went... Please try to get the user's exact location, like when we use google Earth, we must see the user's house or office, depending where... Awesome videos, You know Im a subscriber!!!!
@mikequinn 6 років тому ⁺¹
So let me see if I have this correct. You want to be able to see a users exact location (house/office/whatever) without them granting express permission that allows you to do so?
I don't think so.
@djsamke384 6 років тому
Oh cool... not for perverted reasons... solely with the intention of improving user experience
@GifCoDigital 6 років тому
lol would you like their passwords and door keys as well?? WTF dude there are reasons this is NOT possible.
@djsamke384 6 років тому
Ah maybe you don't think on my level brah... you can have sensitive information and not misuse it. I do have their passwords if they log into my website don't be dumb brah
@djsamke384 6 років тому
@@GifCoDigital btw it is possible using mac address
@cjoshmartin 5 років тому
Just use a set, each time you update the set with the pages new to the list
@pawanpoojary2339 4 роки тому
10:23 when your code runs in first attempt without any error.
@stephenjames2951 6 років тому ⁺¹
Use your regex capture group to get the url before the page number - not hard coded
@OfficialDevTips 6 років тому
Great suggestion!
@thiagovilla970 5 років тому
How so?
@FredoCorleone 5 років тому
The meat of this is something like: _return (extractedParners.length) ? partner.concat(extractParners(id++)) : []_
@Mordredur 6 років тому ⁺²
Wouldn't an infinite loop work that breaks when there is a 404?
@OfficialDevTips 6 років тому ⁺⁸
Sure it would work. It's the never-ending discussion "what is best, a loop or a recursion". Many argue loops are easier to comprehend and that's why you should use them. But that also has had the effect people rarely use recursion and do not understand it. The purpose of the video is to explain recursion through an example you can relate to - instead of the traditional Fibonacci numbers example. Who uses that in the real world?
JavaScript is a functional language and the recursive approach fits better, in my opinion. Haskell, also a functional language, doesn't even have loops, you have to recurse.
(It is still a 200 OK response in this example. I take it you mean when reaching a page with no items in it.)
@Mordredur 6 років тому ⁺¹
DevTips Thanks for the answer. An infinite loop solution would have been a boring video :)
@OfficialDevTips 6 років тому ⁺³
Also I expect to crawl nested structures later on. Like a category tree structures. Then it would also be more difficult with the loops.
@ronijuppi 6 років тому
@@OfficialDevTips I had to write a recursive and a non-recursive function that prints a binary tree for a tech interview. Writing that without recursion was surprisingly hard. I always just defaulted to recursion on that one.
@mazwrld 6 років тому
Yoo can you guy do basic Java programs.. I studying it in school
@GifCoDigital 6 років тому
lol change schools!! Dont waste your time learning that crap. And definitely dont ask a JavaScript dedicated channel to teach it.
@FraJaiFrey 6 років тому
Great videos! Does anybody know a channel that is fun like this one ir fun fun function but that uses Python?
@sazzadhossainnirjhor5582 3 роки тому
If there is no page=1 or page=2 . ?
and paginating with (javascript:GoPage(2) or javascript:GoPage(3)) how can i scrap ? (jobs.bdjobs.com/jobsearch.asp?fcatId=1&icatId=)
please anyone have any idea about that then suggest me. i stuck here..
Thanks in advance
@pureretro5979 6 років тому ⁺¹
Using a regular expression to pull out the page number seems a little odd to me rather than simply passing the function a base URL and the initial page number. Still, a great video.
@OfficialDevTips 6 років тому ⁺⁶
I'm in a Regex Anonymous. I use it to make coffee.
@ConquerJS 6 років тому
LOL
@mtheoryx83 6 років тому
Kinda just imagining their SEO/Analytics folks freaking out at the web hits all over their site! Just to keep developers on their toes, send over an IE6 web user agent. (I'm kidding, don't do that)
@JaveGeddes 3 роки тому
For someone who wants views you're awfully careless with our time whistling in the beginning of the video, then going straight to an ad..
@madturk7057 Рік тому
Haaahaaa code is ugly

Наступне

Автоматичне відтворення

Web Scraping with Puppeteer, NodeJS & Shopify