Це відео не доступне.
Перепрошуємо.

How I Scrape multiple pages on Amazon with Python, Requests & BeautifulSoup

Поділитися
Вставка
  • Опубліковано 21 лис 2020
  • In this video I will demonstrate one of the ways to deal with the pagination when scraping the amazon website. We check to see if the next button is availabe then collect the url from it, and using our functions, move to scrape the next page. this works well as we can let it run and collect all the pages without having to add a number to the url each time. This method would also work for other websites that have a similar style of pagination
    code: github.com/jhn...
    Digital Ocean (Affiliate Link) - m.do.co/c/c7c9...
    -------------------------------------
    Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
    -------------------------------------
    Sound like me:
    microphone amzn.to/36TbaAW
    mic arm amzn.to/33NJI5v
    audio interface amzn.to/2FlnfU0
    -------------------------------------
    Video like me:
    webcam amzn.to/2SJHopS
    camera amzn.to/3iVIJol
    lights amzn.to/2GN7INg
    -------------------------------------
    PC Stuff:
    case: amzn.to/3dEz6Jw
    psu: amzn.to/3kc7SfB
    cpu: amzn.to/2ILxGSh
    mobo: amzn.to/3lWmxw4
    ram: amzn.to/31muxPc
    gfx card amzn.to/2SKYraW
    27" monitor amzn.to/2GAH4r9
    24" monitor (vertical) amzn.to/3jIFamt
    dual monitor arm amzn.to/3lyFS6s
    mouse amzn.to/2SH1ssK
    keyboard amzn.to/2SKrjQA

КОМЕНТАРІ • 126

  • @JohnWatsonRooney
    @JohnWatsonRooney  3 роки тому +10

    UPDATE: check the repo for a short code tweak - github.com/jhnwr/amazon-pagination
    def getdata(url):
    r = s.get(url)
    r.html.render(sleep=1)
    soup = BeautifulSoup(r.html.html, 'html.parser')
    return soup

    • @KhalilYasser
      @KhalilYasser 3 роки тому +1

      Amazing. Thanks a lot for your support.

    • @axvex595
      @axvex595 3 роки тому +1

      i tried this script as well, no luck...

    • @axvex595
      @axvex595 3 роки тому +1

      this is the error I'm getting;
      "The application has failed to start because its side-by-side configuration is incorrect. Please see the application event log or use the command-line sxstrace.exe tool for more detail"
      Any ideas!?

    • @genedummac
      @genedummac 3 роки тому +2

      Great tutorial bro! Please tell me what vscode theme you are using in this video, I like it. Thanks

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому +2

      @@genedummac sure its called One Dark Pro

  • @vahsek7488
    @vahsek7488 Рік тому +2

    best and the simplest way for scraping amazon products,hats off to you. lots of love from India

  • @proxyscrape
    @proxyscrape Рік тому +4

    Amazing tutorial John! I love how you break down the process of pagination. Keep up the great work :)

  • @pradeepkumar-qo8lu
    @pradeepkumar-qo8lu 3 роки тому +6

    This method is more intuitive that concatenating the page numbers
    Thanks for the useful content 👍

  • @michaeltillcock3864
    @michaeltillcock3864 Рік тому +4

    Subscribed, really well explained. I would love to see a video showing web scraping of wikipedia tables, with a loop that can input different wikipedia URLs based on URLs stored in an Excel file. I can't find a video on this and I think it would be very popular!

  • @leleemagnu6831
    @leleemagnu6831 3 роки тому +2

    Great videos John! I could not be more grateful. Thank you!
    A suggestion for a more concise ending:
    while url:
    data = get_data(url)
    url = get_next_page(data)
    print(url)
    print('That\'s all folks
    THE END')

  • @3wXpertz
    @3wXpertz Рік тому +1

    I have a website where next page doesn't show any link it just show # at the end of URL every page I move the URL doesn't change it is shows just # at the end of every page and number I hover. How to get the page URL for each individual pages?

  • @jonathanfriz4410
    @jonathanfriz4410 3 роки тому +1

    Very nice John, I always learn something new with your videos.

  • @JayBeeDev
    @JayBeeDev 3 роки тому +1

    You are a hero John ❤️! Love what you do!

  • @eddiethinhvuong1607
    @eddiethinhvuong1607 3 роки тому +7

    Hi John, thanks for the video very intuitive and easy to understand for a newbie. I've been always using Selenium for webscraping and been learning a bunch from your videos :)
    I've got a question hopefully you could answer: What would you do if the response is a captcha instead of the actual site when sending a request? I have been trying to find away to get through it but found none. Thank you!

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому +3

      Thanks! Captchas are a bit more of a challenge, have a look at some captcha solving services and see how they work, you will get a better understanding that way!

  • @ammaralzhrani6329
    @ammaralzhrani6329 3 роки тому +1

    Please keeps going! Your channel will grow slightly. Promise

  • @samvid1992
    @samvid1992 3 роки тому +1

    Thank you very much. This is exactly what I was looking for.

  • @Mangosrllysuck
    @Mangosrllysuck 3 роки тому +1

    Great content! Liked and subbed. Thanks for doing this

  • @shahalmoveed6191
    @shahalmoveed6191 3 роки тому +2

    Thank you sir for your brilliant explanation 💯

  • @raywong9832
    @raywong9832 Рік тому +1

    HI, thanks for the nice video. I ran into a page that navigates page via dropdown box. I was able to scrap all the options value from the dropdown, but don't have any idea how to navigate to page from the options. Do you have any existing videos or any documentations for me to reference to?

  • @justins7796
    @justins7796 3 роки тому +2

    A++ videos man, you'll be big in no time :D

  • @mohammedthanveerulaklam9288

    the code runs smoothly on the first iteration but when it moves to second iteration on the loop, it fails and shows the following error:
    if not page.find('li',{'class':'a-disabled a-last'}):
    AttributeError: 'NoneType' object has no attribute 'find'
    I Dont know the solution, please help....☹

  • @kamaleshpramanik7645
    @kamaleshpramanik7645 3 роки тому +1

    That is very helpful video. Thank you very much Sir.

  • @adnanklc1527
    @adnanklc1527 3 роки тому +2

    This is very nice content thanks for that. Can we get a full video that we can pull user comments from a dynamic page (for example from ten pages) and add them to the list?

  • @anilfirat7651
    @anilfirat7651 3 роки тому +1

    Hi John, another nice content! Thx a lot!
    Can you do a video for price tracker includes price inside buybox, shipping price, fee etc. based on deliver location?
    Because these informations may change based on location. Hope you see this comment.
    Keep rocking mate!

  • @mikkiverma9545
    @mikkiverma9545 3 роки тому +1

    Thanks John it really helped .

  • @erkindalkilic
    @erkindalkilic 3 роки тому +1

    Thank you very much bro. The information you have given is amazing.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому +1

      You are most welcome

    • @erkindalkilic
      @erkindalkilic 2 роки тому

      @@JohnWatsonRooney sir how can i reach you? Twitter? any Email? Social Media Platform?

  • @prateeksharma-ig5qg
    @prateeksharma-ig5qg 2 роки тому

    What can we do if the url doesn't consist page number?
    Please Help..

  • @Magma-uw7yo
    @Magma-uw7yo 6 місяців тому

    Is it possible to get the content with a loop if the url don't change ? When I click on the button, the content change but not the url

  • @ikalangitahaja
    @ikalangitahaja 2 роки тому

    Great, but not working if products list has no pagination, just check if pages exists to solve ity

  • @BotanicalOdyssey
    @BotanicalOdyssey 3 роки тому +1

    These are so great John thank you for posting!

  • @yogeshkumarshankariya642
    @yogeshkumarshankariya642 2 роки тому +1

    Hi John, what can I do when the next page URL postfix shows like 'MTYzMjMwMzE3NDAwMHw2MTRhZjg0NjYyMjNlMjIxMThiNzYxODY', instead of number also while scraping from inspect page it shows different URL in python compare to what it have in inspect page.

  • @irfankalam509
    @irfankalam509 3 роки тому +1

    Very useful one! Keep Going!

  • @pietravalle69
    @pietravalle69 Рік тому

    i m new in Python when i get this erroe : ModuleNotFoundError: No module named 'requests_html'

  • @samithagoud189
    @samithagoud189 Рік тому

    How to webscrap the job recent job openings posted 24 hours ago?

  • @Adaeze_ifeanyi
    @Adaeze_ifeanyi Рік тому

    So i am the only one confused with the web scraping method i have watched like tons of yoyr videos and i am still getting errors i need helppppp i can't seem to load all the pages i have tried all the methods in your videos.

  • @manny7662
    @manny7662 3 місяці тому +1

    Would you recommend web scrapping or using an official API?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 місяці тому +1

      Official API if you can. Structured data and no breaking changes without warning (hopefully)

  • @ismailsufiani2810
    @ismailsufiani2810 Рік тому +2

    appreciated

  • @jayp9148
    @jayp9148 2 роки тому

    hey john i’m getting this error
    if not page.find(‘span’, {‘class’, ‘s-pagination-item s-pagination-disabled’}):
    AttributeError: ‘NoneType’ has no attribute ‘find’

  • @nikhildoye9671
    @nikhildoye9671 Рік тому

    Hi John, On Zomato - Food delivery app, next button gets hidden once we reach the final page. How should I proceed?

  • @cm4u825
    @cm4u825 3 роки тому +1

    hi dear i am having this error what i do now
    RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому

      Hi! This is because you are using Jupiter notebooks or similar, it causes issues - this code needs to be run as a py file using vs code or another editor

    • @cm4u825
      @cm4u825 3 роки тому

      @@JohnWatsonRooney thanks for reply.
      Can this be working fine in spyder or any other . plz Share other editor name

  • @Gh0stwrter
    @Gh0stwrter 3 роки тому +1

    Great video dude

  • @Dr_Knight
    @Dr_Knight 2 роки тому

    Thanks for this video! Is it possible to parse data if there is a button which loads more data on the same page using Beautiful soup?

  • @srivathsgondi191
    @srivathsgondi191 Рік тому +1

    Hi John, ive tried to scrape amazon website, but their anti bots keep blocking my requests. My status_code is always 503. How do i fix this?

    • @JohnWatsonRooney
      @JohnWatsonRooney  Рік тому

      Have you tried the correct user agent? Copy some of the headers from when yo load the page up in chrome too that should help!

    • @srivathsgondi191
      @srivathsgondi191 Рік тому

      @@JohnWatsonRooney i fixed the issue by using proxies to generate requests.

  • @nidhikaushik2861
    @nidhikaushik2861 2 роки тому

    Hi John, Thanks for the amazing content I have learnt a lot from your videos...I have question...while printing out the soup it is giving me 503- Service Unavailable Error...How to deal with that?🙄

  • @SajjadKhan-cn6mv
    @SajjadKhan-cn6mv 2 роки тому

    if not pages.find('li', {'class': 'a-disabled a-last'}):
    AttributeError: 'NoneType' object has no attribute 'find'
    running exact code....pages is Nonetype

  • @TechsAndSpecs
    @TechsAndSpecs 3 роки тому

    Thanks for this video. How can we limit the search to only the first 10 pages?

  • @alessandrowind4544
    @alessandrowind4544 3 роки тому

    Hello i updated the script and i try to solve this but i still get this error
    Traceback (most recent call last):
    File "C:\Users\Mark\Desktop\script.py", line 76, in
    url = getnextpage(data)
    File "C:\Users\Mark\Desktop\script.py", line 67, in getnextpage
    if not pages.find('li', {'class': 'a-disabled a-last'}):
    AttributeError: 'NoneType' object has no attribute 'find'

    • @alessandrowind4544
      @alessandrowind4544 3 роки тому

      in method getdata(url) ofc i do other operations like printing out some datas

  • @dongmogilles3209
    @dongmogilles3209 2 роки тому

    i wish to scrape over ten pages how can i use the for loop in your code? thanks

  • @imranullah7355
    @imranullah7355 2 роки тому

    Sir I get empty list from soup.find_all("div",class_="some class"), although there are some children of this class
    What can be the reason?

  • @faker_fakerplaymaker3614
    @faker_fakerplaymaker3614 2 роки тому

    didnt work for me.....the html was different. Its always different from the tutorials so I never know how to access the tags.

  • @brokerkamil5773
    @brokerkamil5773 11 місяців тому

    Thx John ❤❤

  • @quangjoseph8287
    @quangjoseph8287 2 роки тому

    Hi bro, I’m stuck in problem with Web Worker APIs when scraping websites. The website is always sending a preflight request before it send the main request. Could you please make a video about it?

  • @ansadhedhi2469
    @ansadhedhi2469 2 роки тому +1

    I am unable to import requests_html. what could be the issue?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 роки тому

      make sure you install it with pip first, - pip install requests-html

  • @rajatchauhan6675
    @rajatchauhan6675 3 роки тому

    what method can be used to scrape load more data

  • @stalluri11
    @stalluri11 3 роки тому

    Hi John, how do we scrape data when url doesn't change for next page?

  • @danielcanizalez8558
    @danielcanizalez8558 2 роки тому

    great tutorial, thanks!!!!
    i need to process like 100 url and it return a 503 :( any help?

  • @technoscopy
    @technoscopy 2 роки тому

    Hello my submit data is in JSF viewstate how to scrape that please help me 😭😭

  • @python689
    @python689 Рік тому

    Hello, help me please, how to get the text out "Wilson Tour Premier All Court 4B"
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('h1', class_='product--title')
    Tennis balls Wilson Tour Premier All Court 4B

  • @thetravellingdream3480
    @thetravellingdream3480 3 роки тому +1

    I am getting this error : cannot import name 'HTMLsession' from 'requests_html'

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому

      I think it’s a capitol S for session

    • @thetravellingdream3480
      @thetravellingdream3480 3 роки тому

      @@JohnWatsonRooney I fixed it by installing requests_html via a pip install. Thanks for the reply anyway great tutorial :)

  • @muhammedasrar8773
    @muhammedasrar8773 3 роки тому

    Hi John, Nice one. Can we able to get EAN/UPC from amazon website

  • @muhammadjamshed2128
    @muhammadjamshed2128 Рік тому

    How can I fetch any Amazon product BSR in Google sheet on daily bases. Please make a video on it to track the products BRS, price Reviews numbers daily. Thanks for all tutorial.

  • @wikd13
    @wikd13 3 роки тому +1

    Really helpful video.

  • @almirf3729
    @almirf3729 Рік тому +1

    Awesome video, thanks

  • @lautarob
    @lautarob 2 роки тому +1

    I have seen a couple of more videos today. All excellent. Thank you so much. I would like to see a video scraping a site protected by login credentials. Would be that possible?.
    Other: Would it be possible to scrape the content of a sharepoint site having admin credential to access it ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 роки тому +1

      Interesting prospect, not one that I have done really as most things like that have an api that are best used. I know sharepoint does but I understand that in most companies getting access to the api would be quite difficult

    • @lautarob
      @lautarob 2 роки тому

      @@JohnWatsonRooney thanks John. The main reason to suggest these topics is precisely because they are complex and your clear and thoughtful explanations would be of great help.

  • @hogrider423
    @hogrider423 3 роки тому

    i did exactly the same thing but i have error message
    if not page.find('li', {'class': "a-disabled a-last"}):
    line 15, in
    getnextpage
    if not page.find('li', {'class': "a-disabled a-last"}):
    AttributeError: 'NoneType' object has no attribute 'find
    what should i do??

  • @LiverpoolDon1981
    @LiverpoolDon1981 3 роки тому +1

    dude you're awesome 😎

  • @avinashk8231
    @avinashk8231 2 роки тому

    Make video on book scraping price used book price hardcover price paperback price etc.

  • @ollie5845
    @ollie5845 2 роки тому

    Does anybody know what theme this is?

  • @rahulwadwani9345
    @rahulwadwani9345 Рік тому

    sir how do you handle 403 errors can you make a detailed video on it if any can you tag it over here

  • @renemiche735
    @renemiche735 3 роки тому +1

    Hi John,
    What is the difference between requests and requests-html? I have no answer to this question. As an exemple, why using requests html and not requests in this case? (one vs the other)
    Thanks for your work from france :)
    (I got an error 503, I go to your new tutorial ).

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому +4

      Hi Rene. They have similar names however they are 2 separate Python Libraries (made by the same person). requests is for working with http protocols, and thats is, requests-html also has its own HTML parser, and the ability to render a pages javascript, allowing us to scrape more sites

    • @renemiche735
      @renemiche735 3 роки тому +1

      @@JohnWatsonRooney merci 🙏

  • @MarcelStrobel
    @MarcelStrobel 3 роки тому +2

    Hey John, fantastic content as usual! I get the following error - Could you please explain why?
    page = soup.find('ul', {'class': 'a-pagination'})
    TypeError: slice indices must be integers or None or have an __index__ method

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому

      Hi Marcel, it seems that it is not finding that html element on the page, I’d suggest checking that the page was rendered properly, and have a look to see why the pagination list isn’t appearing

    • @MarcelStrobel
      @MarcelStrobel 3 роки тому +1

      @@JohnWatsonRooney Hey John, in fact, with html_requests the page wasn't rendered properly. So now I am using splash and the Site is rendered properly. I also checked manually if the specified class is there. Could it be a problem with the Python Version? I am using 3.6.2.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому

      @@MarcelStrobel did you have the sleep=1 in there? odd, i've not had an issue with that before

    • @MarcelStrobel
      @MarcelStrobel 3 роки тому

      @@JohnWatsonRooney I´m gonna try it and get back to you. Thank you very much for your help!

    • @MarcelStrobel
      @MarcelStrobel 3 роки тому

      @@JohnWatsonRooney I put your tweak in and now I am getting a valid response from html_requests. Still getting the same error as above. I send you an invite for the code on git. If you have the time to look at it .. that would be highly appreciateded

  • @aslammasood9504
    @aslammasood9504 2 роки тому

    It''s visibiy is not clear.

  • @fabianrestrepo82
    @fabianrestrepo82 3 роки тому +1

    Fantastic!

  • @mattmovesmountains1443
    @mattmovesmountains1443 3 роки тому

    Of all the scraping tools you use, which would you recommend for building a bot to buy ps5? Seems like a possible viral tutorial right now

    • @mattmovesmountains1443
      @mattmovesmountains1443 3 роки тому +1

      After writing this, I decided to try helium and was able to get a few basic refresh and auto-purchase bots running for the stores that don't use bot detection.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому +1

      That would have been my suggestion. Perhaps use basic scraping techniques to scrape multiple stores pages every hour or so too see if it’s in stock. If it is the run helium to add to cart and email you to complete the purchase

  • @KhalilYasser
    @KhalilYasser 3 роки тому

    Thank you very much. I encountered this error ` if not pages.find('li', {'class': 'a-disabled a-last'}):
    AttributeError: 'NoneType' object has no attribute 'find'`. Can you help me fixing that?

    • @henrygreen737
      @henrygreen737 3 роки тому

      I am getting the same error. My getdata soup value is returning "To discuss automated access to Amazon data please contact api-services-support@amazon.com.". I tried using a user-agent and got the same result. I don't have answer but, this is my problem.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому +1

      check the repo i updated it, the getdata() function should look like this:
      def getdata(url):
      r = s.get(url)
      r.html.render(sleep=1)
      soup = BeautifulSoup(r.html.html, 'html.parser')
      return soup
      github.com/jhnwr/amazon-pagination
      thanks!

    • @mezianibelkacem650
      @mezianibelkacem650 3 роки тому

      yes i got this error too

  • @harshparikh7898
    @harshparikh7898 3 роки тому +1

    thanks a lot!

  • @goujoe2880
    @goujoe2880 Рік тому

    if not pages.find('li', {'class': 'a-disabled a-last'}):
    ^^^^^^^^^^
    AttributeError: 'NoneType' object has no attribute 'find'

  • @mylordlucifer
    @mylordlucifer 2 роки тому +1

    Thanks

  • @ismaelruizranz7799
    @ismaelruizranz7799 3 роки тому +1

    Great video mi friend , ¿Do you know any way to use the requests_html library in the jupyter notebook?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 роки тому

      Thank you. You can’t use it with Jupiter I’m afraid as they both use the same event loop and it clashes

    • @ismaelruizranz7799
      @ismaelruizranz7799 3 роки тому

      ​@@JohnWatsonRooney No problem John , i can still learn your method , if anyone have the same problem quickfix is to create the bot in a .py document in step of a .ipynb
      the you can execute it with the comand line , in mi case using linux just code in the comand line python3 bot.py

  • @UbaidKhan-cm2gz
    @UbaidKhan-cm2gz 3 роки тому

    there is no class "a-disabled a-last" . I am scrapping amazon.in.

  • @barathsekar8616
    @barathsekar8616 2 роки тому

    Do anyone here knows how to scrap the view-more button

  • @azle7206
    @azle7206 3 роки тому

    scrape data asin on multipage please sir with script....

  • @viettuan5798
    @viettuan5798 3 роки тому

    Helpful. Like and Sub. Can you make one video about how to handle when Amazon detect scrape script as a bot. Thanks

  • @cornstarch27
    @cornstarch27 2 роки тому

    John- This may help you with your bs4 issues: ua-cam.com/video/6K3UpktQH9w/v-deo.html.

  • @im4485
    @im4485 3 роки тому

    Hi John, can you please explain r.html.html? why twice?

  • @blogsbarrel4734
    @blogsbarrel4734 3 роки тому

    if not page.find('li', {'class' : 'a-disabled a-last'}):

  • @Cubear99
    @Cubear99 3 роки тому

    i took all the steps and it does not show page 2. keep looping page 1 till i brake if i copied your code this what happened
    2 # this will return the next page URL
    3 pages = soup.find('ul', {'class': 'a-pagination'})
    ----> 4 if not pages.find('li', {'class': 'a-disabled a-last'}):
    5 url = 'www.amazon.co.uk' + str(pages.find('li', {'class': 'a-last'}).find('a')['href'])
    6 return url
    AttributeError: 'NoneType' object has no attribute 'find'

  • @Adaeze_ifeanyi
    @Adaeze_ifeanyi Рік тому

    def transform(soup):
    articles = soup.find_all('article', {'itemprop' : 'review'})
    for feedback in articles:
    title = feedback.find('h2').text.replace('
    ', '')
    ratings = float(feedback.find('div', {'itemprop': 'reviewRating'}).text.replace('/10','').strip())
    body = feedback.find('div', {'class': 'text_content'}).text.replace('✅', '')
    date = feedback.find('time').text
    reviews = {
    'title' : title,
    'ratings': ratings,
    'body': body,
    'date' : date
    }
    reviewlist.append(reviews)
    if not feedback.find('li', {'class': 'off'}):
    url = 'www.airlinequality.com/airline-reviews/british-airways' + str(feedback.find('li')).find('a')['href']
    return url
    else:
    return