Amazon Web Scraping Using Python | Data Analyst Portfolio Project
Вставка
- Опубліковано 7 вер 2024
- Take my Full Python Course Here: bit.ly/48O581R
Web Scraping isn't just for those fancy "programmers" and "software developers". Us analysts can use it too! In this project I walk through how to scrape data from Amazon using BeautifulSoup and Requests.
LINKS:
Code in GitHub: github.com/Ale...
Anaconda: www.anaconda.c...
Find Your User-Agent: httpbin.org/get
____________________________________________
SUBSCRIBE!
Do you want to become a Data Analyst? That's what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content!
____________________________________________
RESOURCES:
Coursera Courses:
Google Data Analyst Certification: coursera.pxf.i...
Data Analysis with Python - coursera.pxf.i...
IBM Data Analysis Specialization - coursera.pxf.i...
Tableau Data Visualization - coursera.pxf.i...
Udemy Courses:
Python for Data Analysis and Visualization- bit.ly/3hhX4LX
Statistics for Data Science - bit.ly/37jqDbq
SQL for Data Analysts (SSMS) - bit.ly/3fkqEij
Tableau A-Z - bit.ly/385lYvN
Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!
____________________________________________
SUPPORT MY CHANNEL - PATREON/MERCH
Patreon Page - / alextheanalyst
Alex The Analyst Shop - teespring.com/...
____________________________________________
Websites:
GitHub: github.com/Ale...
____________________________________________
All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for
The real talk is nice. “It took ten hours over two weeks”. These are things people need to hear. Some people watch these videos on YT and think it is just that easy. This is why your channel is on my short list of channels I subscribed to. Thanks for all your time on these.
Hey MS Excel - sponsor this channel!
I try to make it as realistic as possible - I used to think people could do this all off the top of their heads and I would get discouraged. Glad to hear that! :D
@@AlexTheAnalyst For the same product, I couldn't find the id for price...it shows div class...what to do?
This should work if you tweak it well enough
@@pkabir4625 go little bit up you will find id but you have to use strip function and [1:4] or insert the values as per your requirement to get the exact vales. this worked for me
it didnot work for me , not showing price id ,its in span tag
Alex is so honest and down to earth, he doesnt have that usual UA-camr vibe that we are accustomed to. Man we're so lucky to find you as a mentor.
That means a lot! Thanks for watching! :D
@@AlexTheAnalyst hi. how to find the code you showed right of the 't-shirt' web page?..you selected price...then the code for price got selected. how to do that?
@@pulakkabir2276 right click and click in inspect or use ctrl+shift+i
The section where you speak about how you shouldn't know this by heart is so good. Honestly... I am learning SQL as per your recommendation, but in the back of my head I am scared as I think I should learn and memorize each single block of code... And this is awful... Thank you for being honest and clear on that!
How is it going?
man i've been battling with the bot blocker from amazon and also some scraping issues with price because the website display was changed a while after this video was uploaded, but I've managed to pull it off so i hope this might help those recent viewers who might be as confused as me when I started writing this code on my own.
apparently you need to divide the second cell so you need to run the soup1 first before you run the soup2, then for the price you need to pull three part of span class=a-price-symbol, span class=a-price-whole, and span class=a-price-fraction and combine it together into 1 new variable (price), then you need to clean it using strip() and replace() to clean the whitespaces and
's.
hope this helps!
Brother please elaborate it, I am stuck
Hey bro can you explain it or can you share your code, how you pulled three part together. I am stuck in this part
Could you please explain it, I am stuck in getting Title itself
@@sdivi6881 Hey i just solved it can you tell little more where you are getting stuck
@@deeplakshmiyadav
price_symbol = soup2.find(class_='a-price-symbol').get_text(strip=True)
price_whole = soup2.find(class_='a-price-whole').get_text(strip=True)
price_fraction = soup2.find(class_='a-price-fraction').get_text(strip=True)
price = f'{price_symbol}{price_whole}{price_fraction}'
print(price)
it's been a year on this project and despite me searching and watching other channels, I always come back to your channel ,you are simple the best person I have learned from . you are genuine and always able to get your point across .I hope you expand your "python for data analysis" series just like you did with SQL.
Thank you so so much .
As others described, if you get an error when running the second cell it's probably due to a captcha issue where amazon thinks you are a bot. You can force it by pressing ctrl + enter again and again until you get an output. I'm sure there is a better way to get around this but that's the quickest semi solution I found.
The while loop definitely doesn't work now that Amazon updated their website with some sort of blocker for bots. It might work a few times but eventually stops running in background.
Hey Alex quick tip: When you were working on spaces like 34:21 if you select everything that you wanna move and press tab this way everything you choose goes 1 tab right. Little things like that improve your quality of life sometimes. Thanks for the tutorial :)
You don't imagine how this tutorial has helped me in my new position. Thank you so much!!
So glad it was helpful! :D
@@AlexTheAnalyst Did you ever make the second one? So many people want to see it? Please do send it out!
Unfortunately it no longer works (due to Amazon website update I believe, as others have commented) :/ would love another scraping video so I can learn!! Love all the videos Alex and thanks so much!😊
@@nezzylearns happy to help
Were you able to bypass the Amazon scraping detection? I am also receining NoneType error.
@@VishalSharmaOfficialVS I unfortunately wasn't able to figure it out :/ This is one of the harder projects (to me) so I was going to circle back after going through the rest of Alex's projects. If you figure out how to bypass it plz comment here with an update!
@@krystlestevens2585 sure! I’m working on it. As soon as I have a concrete solution, I will post it here. Thanks for your reply.
@@VishalSharmaOfficialVS did you ever figure this out?
Sir, I am very near to get my first job through your project
Thank you
And this is also my first project
Hey Alex! Thanks for this helpful video! The best part of this video is whenever you said 'I don't know what that is' (12:50) , instead of some difficult theory. You don't know that, I don't know neither, so it makes me feel less pressure on learning python...
This project gave me a taste of how challenging web scraping is. Great video that makes things look easy and less intimidating.
So great Alex! I followed along with this entire project and added it to my portfolio! I'll be sure to give you credit in my README file. :)
Hi Alex, seems like this code is now not working. Would be grateful if you could do another web scraping project with EDA analysis?
Love how instructive your videos are.
By any chance was there a part 2 to this with the more advance scraping? Would love to see that :)
Looking for the part 2 you mentioned in the vid!! Thanks
Hi Alex, I really appreciate how you shared how long this project actually took you. It helps to know the difference between what we go through on your channel and the work/time it actually takes behind the scenes. AWESOME project! I learned tons and found all of it very useful/helpful. You are such an AMAZING teacher and resource! As always, THANK YOU!!
Bro did you got nonetype error and how you solved it?
I have this error bro, and don't how to solve it
@@valadhruv6920
@@valadhruv6920 bro, did u found the solution for this? i can't figure it out
What attracted me from your video hh is that you have 3 kids , this is a great man
God bless your family
you're already doing a great job man. Thanks a ton, and hats off to you.
But,
We need that part 2. Please do it asap Alex.
29:30 - quick tip: select the file, hold shift and right-click to get “copy as path” in the context menu.
anyone stuck trying to get the price.
price_symbol = soup2.find(class_='a-price-symbol').get_text(strip=True)
price_whole = soup2.find(class_='a-price-whole').get_text(strip=True)
price_fraction = soup2.find(class_='a-price-fraction').get_text(strip=True)
price = f'{price_symbol}{price_whole}{price_fraction}'
print(price)
Hi. I am not even able to print(soup1) due to some sort of anti bot blocking from Amazon. do you have any idea how to solve that? Thank you
Thank you!
Thank you man, I realized something was different with the html but lack the coding skills to fix it in a timely manner.
thumbs up, i've spend too much time looking for this comment, @Alex can you include that in the bio?
Thank you so much for this
"AttributeError: 'NoneType' object has no attribute 'get_text'" to solve this
delete headers
still showing the same error
It worked
@@Kshitij-Rajkumar-Yadav can you elaborate it bro
@@mulikinatisiddarthasiddu8245
sure man
If u get the above error
Go to the place where u entered the URL
Then headers
delete the header
So it'll be
URL
page = request.get(url)
Soup1
This should fix it
If u have any other issues lemme know I just finished this code so I went through everything I'll share it
@@mulikinatisiddarthasiddu8245 price_symbol = soup2.find(class_='a-price-symbol').get_text(strip=True)
price_whole = soup2.find(class_='a-price-whole').get_text(strip=True)
price_fraction = soup2.find(class_='a-price-fraction').get_text(strip=True)
price = f'{price_symbol}{price_whole}{price_fraction}'
try this it will work if you are getting the error to get the price
Thank you for this, Alex. I felt so happy when I finally could scrap the website I had been trying to scrap (I applied your teaching to another website). Really appreciate your work.
Hey Alex! It was a super helpful video. Thank you so much for posting it. Have you uploaded the next part of this video. If yes, Please share the link.
Always thank you for all your efforts and good work! I love watching your videos. Your positive attitude and way of expression make the lesson even more fun. I've seen a few people say the video is too long, but I think being able to walk through the lesson together rather than other videos that show written code is much better for learning. Thank you thank you thank you ☺
You have been consuming alot of sugary, oily and salty meals in spring and summer when you embarked on this project. I have a living remedy for you. The treatment to what is causing you to clear your throat is lack of longsuffering walks and eating the right food at the wrong time and season, coupled with nothing doing the right thing with food in their seasons. In the world right now, only a handful of people know this knowledge and they are definitely not Europeans nor Americans.
If anyone else has a problem like I did with getting a captcha output when printing soup2, I solved it by putting soup2 and the print statement in a different cell then run the first cell with soup1 then run the second cell with soup2 and the print statement separately.
This man a God send gift to ALL the Broke data analyst students
No kidding
When I try to print the title im getting an error message "'NoneType' object has no attribute 'get_text''. What is the issue here?
Same
Thanks for sharing! This is an awesome video. I'm not sure if you did this but I think it would be cool to learn how to scrape multiple pages then append the data in a def function.
Thank you for demonstrating! I never thought that a simple project like this could use as a portfolio project. I just realized that I have what it takes to become a DA. Thank you for demonstrating projects!
I am so grateful for finding you. Almost feels like I know you personally. I'm still very new to this whole Data Analytics but I'm learning a lot.
A quick question: I'm on the Google analytics course by coursera and the language is R. Any ideas on where I can learn python- preferably in a structured way that is beginner friendly?
Again thank you for the work. Truly amazing.
I’m so glad to hear that! I honestly will ups do some UA-cam to just get the hang of it - then I would check out my Udemy Course recommendations in the description below - those are ones I’ve taken and loved. That would be my next step. Thanks for watching! 😁
Very nice 👍 that be good for checking the prices on udemy courses. 😅
I am also taking the Google Analytics course, one question that I would like to ask is, how do you know or prevent bias from the data being collected?
One thing I'd like to point out here is you can easily switch from R to Python. There are plenty of courses out there like Alex had mentioned but the key take away from the course which I did finish and landed a data analysis job, is make sure when following Alex watch how he uses 'pandas' and other packages, which is essentially the same as the tidy-verse in R. Look at the packages and how he writes the code. I think that will help you out the most on top of taking courses.
@@nickmoritz1515 Hey man! How is it going? Can you share some tips that helped you to land a Data Analyst job, maybe there is additional stuff, I've almost finished Alex's data analyst bootcamp and pre-graduate cource bachelor student. I would be grateful for sharing, please?
16:30 Solution to get the price:
price_symbol = soup2.find(class_='a-price-symbol').get_text(strip=True)
price_whole = soup2.find(class_='a-price-whole').get_text(strip=True)
price_fraction = soup2.find(class_='a-price-fraction').get_text(strip=True)
price = f'{price_symbol}{price_whole}{price_fraction}'
print(price)
I'd like to appreciate you for sharing this wonderful video! Thanks to you, I've just managed to make my own webscraper that helps me to save so much time. Otherwise, my coworker and I would have to spend more than 6hours per week😂
Can you please make a video on how to present these projects? I've seen your video about the portfolio website, but I don't have an idea on how to actually present the github..
And thank you very much. Your channel has been very inspirational to me through out my learning journey!
Good idea!
@@AlexTheAnalyst Hi Alex, to further add to my comment - I've taken a look at other "best example" portfolios online but comparing it to the Google data analytics portfolio guidelines, they are very different. Hence my conflict and lack of general understanding on how to present these projects in a website.
Thank you.
if you dont pull in the data due to the captcha, dont use the headers as second argument.
Great video alex ... it was really helpful for a module in my course . Please i have been looking for the intermediate video you spoke about
Hello Alex!
One more step is done!!! It's so exciting, I got stuck at the stage where I had to get price data. I missed this metric to be scraped. Since the time you recorded this video some parts of html have been updated. So now price does not exist in the format of "ID=", it lives now as "div class=". So now it is challenging for to find out how to scrap the price though :))) will go deeper to the topic. Thanks much for your time and for sharing of your knowledge.
Dude! I'm an amazon seller and this kind of work would come in super handy. Thank you. Did you ever get around to making the next video where you pull data from all the search results page? I'd be really interested to see that one.
Hi Alex, I have learn a lot from 65 videos of the Bootcamp. God bless with everything. Thanks!!!
Though the project was quite tricky, I got over it.
Thank you so much Alex.
The long awaited one ❤️💯
Hey Alex, first I want to thank you for this amazing series and everything you do to help the community.
Second, I am working on this project and it seems Amazon implemented a CAPTCHA to prevent scrapping. Is there any way around this? Would love to know if this project is applicable and doable even 2 years later. Cheers!
Yeah. I have been trying for 2 hours to get into Amazon. I think it is a bit more difficult now. Were you able to find a way?
Having the same problem.
Yeah, same here. There are ways to bypass but it looks like it might be borderline unethical. Zenscrape, Apify, or ScraperAPI give you the ability to fetch the data directly from the API instead of the HTML page(beautiful soup).
if you are running into an issue with the header, try this:
headers = {"User-Agent": ".......", "Accept-Encoding": ".....", "Accept": "......" }
just put in whatever you get from the User-Agent link in the video description
Alex please make a video upon how to create your own Data set for data analyst job ? Please make
It was a fun project. Please drop the other version (the complete version) of this project @Alex The Analyst
Hello Alex, thanks for sharing, I have found the error for my code of this section
title = soup2.find(id='productTitle').get_text()
print(title)
output:
'NoneType' object has no attribute 'get_text'
Please I need your advise
same here
Hello!
You can do this:
page = requests.get(url,headers=headers)
soup= BeautifulSoup(page.content, "html.parser")
and then get your data:
title = soup.find(id='productTitle').get_text().strip()
price = soup.find('span',class_='a-offscreen').get_text().replace("$","")
You don't need prettify anymore as your computer can easely read that
@@cocojamborambo5435 Weirdly enough, this works for one moment, and then it stops when I run it again.
Same.
You edit to headers and add 'Referer'
Wow!!! This is awesome!!! You have so easy way to teach, I already have a base with Python but I’ve never made this before and you make this so smother and easy to do!!!! Thank you thank you ❤
thanks Alex! really a great video... request you to kindly do a similar one on stocks realtime price capturing with time series and configure an email notification when the current price drops below say 50 day moving average ....
I am looking forward for many videos like this...thank you!
This script is only giving me short html and I've got "NonType" value at the end.
Thanks Alex I am working on my own web scraping project for checking placements of searches and this video definitely helped
This seems like an amazing project, sadly something change in Amazon policy for scrapping their data and I couldn't access, if someone find a way to make it work I would love to hear it 😁I'll keep going with the other projects!!
so whoever is not able to find the id for the price and are getting tag 'span' and class on clicking on the price(mentioned on the website of the product) in inspect can follow this code
price=soup2.find("span", attrs={'class':'a-price-whole'}).text.strip()
print(price)
replace 'a-price-whole' with whatever you are getting for the class
Where you able to access the site at all?
I did get an error 503 right from the beginning while trying to request the url. However I decided to use my selenium web drivers and it worked for me…. If you don’t know how to use that then I suggest you scrap another website. Amazon has gotten tighter.
You can use Selenium to bypass all of their human checks. But it's a bit more advanced subject.
I'm getting this error when trying to print the title using soup2: tried to resolve it but not.. let me know if anyone has the solution for this:
AttributeError: 'NoneType' object has no attribute 'get_text'
getting the same error
Have you been able to solve it?
Been having the same issue, for a second the code worked and all of a sudden it stopped working
@alex could you please guide why this is the case when we run the code and how to resolve it.
(I think its due to amazons security protocols that detect either a bot or a programming language that's trying to fetch the data.)
@@farazbhatti6120 Amazon has basically caught onto this method of web scraping their site. A newer method involves rotating your user agents constantly - essentially to look like you're accessing Amazon from different devices. However, you also need to pair this with a proxy, otherwise Amazon would see you're trying to access Amazon from different devices, all from the same IP, hundreds of times a day. It's a lot more complicated now and the video is no longer working unfortunately.
same issue here for me! I want to continue with the project but I cannot due to this same get_test error message! @alex please help
similar error
Your tutorials are so good. and i follow you on LinkedIn, your content is awesome. i love how you explain things in a clear way. keep up the great job!!
I am absolutely fascinated by your thorough explanation
After a looooong time delay cause by many things, finally I can finish this portfolio.
One of the only channels with least haters ✨
I wish I had more so I could be cool
Thrilled to successfully get to the end of this @Alex - appreciate these real-world worked examples.
Hello Alex,
thank you so much for all you do. I am using this video now when Amazon doesn't include the word 'price'. While inspecting how do I go about that. I hope you reply because I am sure lot of new learners are having this issue.
Super early, love your stuff as always Alex!
You are very early! Lol Thanks for watching 😁
Did you uploaded the second part.
I loved this one.
Please share second one
NoneType' object has no attribute 'get_text'. Iam facing this error
Thank you so much, Alex! Your teaching style has made learning incredibly enjoyable and accessible. I've learned a lot in just one month and completed my portfolio projects, even though I skipped Excel and Power BI for now. Your anecdotes about your dog, family, and personal experiences have added a fun touch to the learning process. Your impact on learners like me is undeniable, and I'm looking forward to purchasing a course from your website soon. Keep up the fantastic work! 🥂🥂
Wow, this is EXACTLY what I have been looking for. Alex the GOAT in DA. :) You are 1000x Awesome!
Help im really having issues with "AttributeError: 'NoneType' object has no attribute 'get_text'" I have tried everything I could think of, how can I resolve this.
experiecing the same. Used an if loop to determine whether the title exists, apparently it doesn't
Same here !!!
I'M STUCKED IN MINUTE 16:35,
the code doesn't work :(
¿Could someone help us please?
Thanks you!!
did you find a fix? I am having the same problem.
@@yashwanthgunturi8762
Also for 'title ', you can type the text below after soup's definition:
title = soup.find(id='productTitle').get_text()
print(title)
@@yashwanthgunturi8762 yeah I found the error use this code
title = soup2.find(class_='a-size-large product-title-word-break').get_text(strip=True)
"Guys if you can't tell, I'm in need of some help here" 😂😂😂
The struggle is real
BTW some of the website code amazon has changed. New people will need to adjust accordingly.
Thanks Alex! this was really useful. I am waiting for the second part with the pagination 😅😅
Done this project recently and anyone faces the error please edit the code and write this instead
title = soup2.find('span', attrs={'id':'productTitle'}).get_text()
price = soup2.find('span', attrs={'class':'a-price-whole'}).get_text()
it worked !! thankyouu
Hi Alex, still waiting for the other video. thank you
I really like these long videos where you explain things like this instead short video, thanks for uploading Alex !
Glad to hear it! I try to change it up every so often :)
Hey Alex, thanks for the walkthrough. When is the next web scraping project coming? I'm so hyped.
Looking forward for that too
hello Alex?
Thank you for this amazing tuto, it help me so much
please did you do the video for all the website
like how to scrape all the page
thank you again
Very nice tutorial! The amazon seems to change the code on the id = priceblock_ourprice part, could you update the code accordingly?
Thanks a lot for enlightening on Web Scraping. Came to know only after watching this video that such stuff can be done.
Hi, why do i get this error "AttributeError: 'NoneType' object has no attribute 'get_text'" i don't understand this can anybody help how to solve this?
TIA
it cannot find that id named priceblock_ourprice anymore, therefore it returning None
Loved the video. But I really bursted out laughing when you said: I don't want my head to be here for the entire time. I'm gonna get rid of myself!" I thought to myself: Not a good head space to be in. 😅 You are naturally funny.
Thanks for the knowledge and laughter.
His simplicity and humor gets me every time and it helps with the flow of his lessons. So amazing
I thought you were now only gonna only make videos on management and stuff. Glad you are still making tutorials
Nah, content really won't change much - I'll be doing Tableau tutorials very soon
This is what I have been waiting for!
Thank you
Here is a potential fix for the common error: change "html.parser" to "lxml"
Thanks a lot , using html.parser amazon is restricting to scrape
It works as of now. Thanks
I LOVE YOU
Wonderful! I'll practice with this tonight!
Hello i'm getting this error,
AttributeError: 'NoneType' object has no attribute 'get_text'
can anyone guide me through it?
Thanks a lot in advance.
Same error here
I tried with a different product and it works just fine. My guess is multiple people have pinged the same URL so amazon went ahead and blocked it. Try a different product.
Same here, just be sure that the price is avaiable in your zone. In my case the item was not displaying the price, so it is not available to the code. I changed the product, and then it was fine :)
@@azhiylo6403 thankyou brother, my issue is resolved now.
@@shivamsharma379 great 💪🏻you are welcome
The code gives me an error with the price (the product title works though). I get an error "AttributeError: 'NoneType' object has no attribute 'get_text'"
Oh man I was thinking about a project related to amazon data scraping and here youtube suggested me B-)
Hope it helps!
@@AlexTheAnalyst Yes it was, Thank You :-)
When I try to run the first cell, I get the error message
I think Amazon updated their site to make it harder to scrap data. Do you agree?
(Edit)
Fix: soup2.find(‘span’,{‘class’:”a-offscreen”}).text.strip()
Thanks for the fix. It worked.
Thanks!
When will be the next part of web scraping? Thank you for posting this video!!!!!
Hi Alex, this was really great! Thanks for doing this video. Did you ever do the follow-up video that was mentioned at the end of this video?
i'm interested in that follow up as well!!
Love this .. I'm curious about the headers part I didn't know about that before
You have 3 kids! and you are 28, omg you are back as my old wise sensei, master yoda.
Can someone direct me to the other project that you discussed doing in this video. Where you build a scrawled that goes through each page’s content and gets their prices
Really enjoyed this video! Any update on when the one for multiple pages would be ready? I didn't see it on your channel
man it was super easy to understand, you nailed it
So glad to hear it!
You are great, this is exactly what is am looking for...
Thank you! Amazing. Waiting for the next video 😉
at 14:20.. title = soup2.find(id="productTitle").get_text() is giving me this error: AttributeError: 'NoneType' object has no attribute 'get_text'.. can you or anyone else give an idea about why this is happening. Is it possible that amazon is no longer for scraping?
Amazon has basically caught onto this method of web scraping their site. A newer method involves rotating your user agents constantly - essentially to look like you're accessing Amazon from different devices. However, you also need to pair this with a proxy, otherwise Amazon would see you're trying to access Amazon from different devices, all from the same IP, hundreds of times a day. It's a lot more complicated now and the video is no longer working unfortunately.
Really cool project with an email feature in the end! Thanks, Alex.
Mannnn pleaseeee keep going we need your help you tuts are on a whole diff level I am able to learn and understand with ease tnx a lotttttt and once again keep going
Great video! I am stuck on the part where you print the price. I cannot find 'priceblock_ourprice' anywhere. It seems like they changed the way they display their price somehow.
Same here
Looking at HTML from another product I found id="corePrice_feature_div" and id="corePrice_desktop" . I tried the first one and it worked
@@diogenes1683 thanks for the info, nearly broke my brain looking for a price that made sense.
but mine worked with id='apex_desktop'
if you have problems with the price not having an id use its class instead, your code should look like this
price = soup2.find('span', {'class':'a-offscreen'}).get_text()
this should give you the price.
from the first position, that's the word you were looking for.
Sir please make more portfolio projects for fun
Thanks man. You are helping a lot of people like me. Keep doing this portfolio videos!
Thanks Alex. I’m a big fan.