Pagination is Bad for Scrapy and How to Avoid it

Поділитися
Вставка
  • Опубліковано 5 лис 2024

КОМЕНТАРІ • 48

  • @codeRECODE
    @codeRECODE  2 роки тому +2

    This video is from my course on Scrapy.
    I edited this for UA-cam. So watching this one video still makes sense. Hope its useful :-)

    • @saurabhdakshprajapati1499
      @saurabhdakshprajapati1499 2 роки тому

      Sir what is the name of library which you developed, for converting headers to dict

  • @teodortodorov787
    @teodortodorov787 2 роки тому +1

    Thank you sir, had a task at work today about pagination and I remembered to check this vid, code worked perfectly runtime reduced significantly.

  • @aleksandarboshevski
    @aleksandarboshevski 2 роки тому +2

    Thanks for sharing this tip, its very useful for utilizing the real power of Scrapy async.

  • @aleksandarboshevski
    @aleksandarboshevski 2 роки тому +3

    To develop right mindset for fully utilizing async capabilities in Scrapy, probably this method should be the default one to be teached from the beginning, because when you make some coding/thinking habits later is much harder to change that state of the mind thinking differently.

    • @codeRECODE
      @codeRECODE  2 роки тому +2

      You are right. I will update my upcoming course to move this video from the advanced techniques section to the pagination section.:-)

  • @DittoRahmat
    @DittoRahmat 2 роки тому +1

    Wow, I didn't know about that. I always thought that the next_page approach was best practice. Didn't realize there is a better one

    • @codeRECODE
      @codeRECODE  2 роки тому +1

      Glad that it was useful :-)

  • @carloscampos7709
    @carloscampos7709 Рік тому

    Nice way to look at it! Thanks for the video

  • @subrinalazad3215
    @subrinalazad3215 2 роки тому +1

    Thank you for making a video on this great topic !

    • @codeRECODE
      @codeRECODE  2 роки тому

      My pleasure! It genuinely makes me happy if a video is helpful 🙂

  • @tirullow6313
    @tirullow6313 2 роки тому

    Hi thank you for the video. In the Amazon scraper the range goes from 2 to int(total_pages)+1, shouldn't it go from current_page to int(total_pages)+1? thanks.

  • @hungduy3152
    @hungduy3152 2 роки тому

    What a interesting video. I have some question
    1. I'm just a beginner of Web Scraping, Do you show how we can scrape in cloud for free, I see some videos but they are all paid for
    2. Can you make a video how to scrape by jupyter notebook

    • @codeRECODE
      @codeRECODE  2 роки тому +2

      1. As a beginner, you can play around with www.zyte.com/scrapy-cloud/
      However, I would like to challenge you to a different thought process. This thought process changed my earnings directly. Think about all the learning and freelancing work as business. If you are into any kind of business, you invest.
      Similarly, start thinking about learning as investing. Learning will bring you money. If you want to get good money, you need to invest. The moment you start to pay for things that can actually bring you money later, you will have access to far better resources.
      You will learn a lot and earn a lot.
      Good luck!
      2. You can use tools like beautifulsoup4 selenium using Jupyter notebooks. You cannot use scrapy with it. Even though I experiment with lot of tools for web scraping, I always come back to Scrapy. It is simply wonderful!

  • @diegovargas3853
    @diegovargas3853 2 роки тому +1

    Great idea, but, what about crawlspider?

    • @codeRECODE
      @codeRECODE  2 роки тому +1

      Crawl Spiders have rules to extract links. That's already taken care of.
      The Idea is to send requests in bulk rather than sending one by one.

  • @jagdish1o1
    @jagdish1o1 2 роки тому +1

    I always use this approach when available... But didn't know it actually speed up the performance 😂 thanks for the explanation ❤️ love your content.

  • @HP-st6ff
    @HP-st6ff 2 роки тому +1

    Hi! I'm having problems scraping a page. I extract a different number of items each time I run the spider. When I use pagination it loses less items. Do you know what could be the reasons of this? Thank you in advance!

    • @codeRECODE
      @codeRECODE  2 роки тому

      Hard to say without looking at the site.
      Some sites render static and dynamic content based on some triggers. Amz is one such example. Could be the same issue.

  • @vikasunnikkannan
    @vikasunnikkannan Рік тому

    Hi, What if we dont know the page number, would it make sense to scan until faliure of next button presence in a page html and then do pagination on the number of pages available?

    • @codeRECODE
      @codeRECODE  Рік тому +1

      If you are visiting the page, then parse the data as well. Why to visit the same page two times. This approach is more suitable when you can determine the pages without actually visiting.

  • @valostudent6074
    @valostudent6074 2 роки тому

    very insightful sir

  • @syedghouse1696
    @syedghouse1696 2 роки тому

    Hi,
    While scraping I am facing 429 too many requests in scrapy
    Can u pls advise on how to solve
    If possible a video would be great on it

    • @codeRECODE
      @codeRECODE  2 роки тому

      Looks like rate limiting. Use proxies.

  • @mariogreco729
    @mariogreco729 2 роки тому

    Amazing, thank you good sir

  • @ayoubsarab5193
    @ayoubsarab5193 Рік тому

    which method is better this one or Linkextractor

    • @codeRECODE
      @codeRECODE  3 місяці тому

      Depends on site and your usecase. There is no general answer

  • @sobinchacko8710
    @sobinchacko8710 5 місяців тому

    Hello sir I have some doubts if the site has the growing list then how we avoid duplicates

    • @codeRECODE
      @codeRECODE  5 місяців тому +1

      Watch the video on infinite scroll

    • @codeRECODE
      @codeRECODE  5 місяців тому +1

      Scrape Infinite Scroll Pages with Python Scrapy
      ua-cam.com/video/aPpKjhP1r58/v-deo.html

  • @Scuurpro
    @Scuurpro 2 роки тому

    I think I figured out my issue how do you do this for crawl spiders? I'm new. Nvm if you get a chance could you do pagination for crawlers with websites using javascript.

  • @souleymen971
    @souleymen971 4 місяці тому

    Very heelpfull thank you very much

  • @DeepDeepEast
    @DeepDeepEast 2 роки тому

    But even when parse method is using recursion, scrapy scheduler works asynchronously. Its still nice to Iterate the pages.

    • @codeRECODE
      @codeRECODE  2 роки тому

      I am not sure I get your question correctly. Anyways, the point is that the parse method is a callback method. So it is triggered when the response is available. If you can send 100 requests, the parse method will be called 100 times for each response.
      So it's better not to wait for a response but send more and more requests together.

    • @DeepDeepEast
      @DeepDeepEast 2 роки тому +1

      @@codeRECODE my point is that scrapy doesnt wait for a Response to Start a new request, even When using pagination. When I use this pagination concept and log the requests, I can see that scrapy sends the requests before processing the previous ones.

    • @codeRECODE
      @codeRECODE  2 роки тому

      @@DeepDeepEast Cool! So you know what you are doing. That's the whole objective :-)

    • @DeepDeepEast
      @DeepDeepEast 2 роки тому

      @@codeRECODE I don't know, but I just wanted to be Sure if am right.

  • @ahmedellban5748
    @ahmedellban5748 2 роки тому

    Thanks

  • @haideralihassan5053
    @haideralihassan5053 2 роки тому

    That's good idea

    • @codeRECODE
      @codeRECODE  2 роки тому

      Oh yes, it is a good. It took me quite a lot of thinking about how to explain as simple as possible. :-)

  • @pythonically
    @pythonically 2 роки тому

    why i unable to do this in this website ?
    error : url is not defined
    import scrapy
    class ThriftSpider(scrapy.Spider):
    name = 'thrift'
    allowed_domains = ['www.thriftbooks.com']
    url = 'www.thriftbooks.com/browse/?b.search=comic#b.s=mostPopular-desc&b.p={}&b.pp=30&b.tile'

    def start_requests(self):
    for i in range(1,15):
    yield scrapy.Request(url.format(i))


    def parse(self, response):
    for s in response.xpath("//div[@class='AllEditionsItem-tile Recipe-default']"):
    title = s.xpath("//div[@class='AllEditionsItem-tileTitle']/a/text()").get()

    yield {
    'title' : title
    }