on windows you just type in the name of the venv file, then \Scripts\activate as long as you are in the project folder. Example: PS D:\Projects\Scrapy> .venv\Scripts\activate
The issue we faced in part 6 was that the values added to the attributes of our `BookItem` instance in the `parse_book_page` method were being passed as `tuples` instead of `strings`. Removing commas at the end of the values should resolve this issue. Once we fix this problem, everything should work perfectly without needing to modify the `process_item` method.
FYI for those who want to scrape dynamic websites, dynamic websites needs selenium which is not included in this course. But no cap, this is a great course.
I am a python newbie without any experience in coding. With the help of this guide I am able to write a spider and fully understand the architecture. Really helpful👍👍👍They also have other guides to help you polish and functioning your spider, highly recommended!
1:34:58 instead of using a lot of if statement use mapping. for example: # saving the rating of the book as integer ratings = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5} rating = adapter.get("rating") if rating: adapter["rating"] = ratings[rating] This is not only faster but it also looks clean.
Thank you for the time you've put into this tutorial. That being said, you should make clear that the setup is different for windows than Mac. No bin folder for example
For PART 8 if anybody is having trouble with the new headers not being printed to the terminal, make sure in your settings.py file that you enable the "ScrapeOpsFakeUserAgentMiddleware" in the DOWNLOADER_MIDDLEWARES and not the SPIDER_MIDDLEWARES.
for windows users: If you get error first type Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy Unrestricted -Force and after that type venv\Scripts\activate
This is so cool! I was able to follow until Part 6 but from Part 7 I couldn't so I will come back in the future after I have basic knowledge of MYSQL and databases. (Note for myself).
Part4 54:07 if you're wondering why the result of 'item_scraped_count' still only 40 probably href is already full url so don't duplicate its domain teach yourself to improvise💪
@@yooujiin are you in data science? I need some recommendations for ML and web scraping. I tried Jose pradila's course and it wasn't very in depth so refunded that. Please recommend only if you are in the same field or have been suggested the same by someone you know in ds/ai/ml.
@@_Clipper_ I'm currently doing my masters in software development. I would love for some recommendations myself. I recommend a the scrapy course by Ahmed Rafik
44:33 my spider doesn’t give back the data from the html, it crawls but stops without having selected any data. I rewrote the code multiple times but it doesn’t change. *just solved it: had to save the bookspider code first
Thank you so much for providing this content for free. It's truly incredible that anyone with an internet connection can get free coding education, and its all thanks to people like you!
1:24:48 don't forget to remove comas after book_item['url'] = response.url and all others when we add BookItem import. Because i have some values in list instead of string
Please help me, I got 2 errors from this line : from bookscraper.items import BookItem. (errors detected in items and BookItem). Has anyone faced the same issue as me?
Nice video! Unfurtunelly part 6 has a lot of code without debug, so it's really hard to fix errors. Something is going worng with my code, but i can't identify
CAN SOMEONE HELP ME!!!!!???? At part 3 when you create bookscraper, I don't have bookspider.py created for me. What do i do for it to be generated???? I AM CONFUSED
1:49:03 I am getting a black json file with only [ ] inside the file...terminal returns something to do with process_item...How does one solve this? sorry I'm new to this.
If you are wondering why 'process_request' does not work in part-8 make sure that you enabled 'downloader middlewares' in settings.py, instead of 'spider middlewares'...
Note that I copied the code from the tutorial page for the ScrapeOpsFakeUserAgentMiddleware, and when trying to run it I get the following error: (...) AttributeError: 'dict' object has no attribute 'to_string'. SOLUTION: copy the process_request function exactly as it is in the video, not like in the tutorial page.
if anyone got error during the sql part keep in mind to comment the previous feed setting which we have selected as csv format that was creating error for me.
In part 6, at the start of the process item function, despite having the exact same code as the tutorial my value = adapter.get(field_name) returns the exact value and not a tuple, so it was unnecessary to add the index in the following line, does anyone know why this is happening?
Can scrapy get data from Cloudflare-protected websites? I just want to extract a list of holidays from our country's government websites to automatically store them in a table, but they don't have an API for it.
Ok, so you schedule your spiders using scrapeops. But how do you consume the product of such scrapping? As far as I know it's just being stored in the virtual server. Can you retrieve these with scrapeops?
Great course and thank you for your efforts! But in part 11, aren't you publishing your private scrapeops-api-keys to the public? Isn't that a little bit dangerous? Or to ask differently, what would be a good way to do this instead?
For anyone having errors in Part 8 with the fake headers: You need to import this: from scrapy.http import Headers and then in the process_requests function you need to replace this line: request.headers = Headers(random_browser_header)
Thanks for such a wonderful web scraping tutorial. Please make a video tutorial on how to download thousands of pdfs from a website and perform pdf scraping with scrapy. In general, please make a tutorial on pdf scraping as well.
You don’t do the pdf scraping with scrapy- it’s designed for scraping pdfs. You can download the pdfs using scrapy (at least I imagine you can), but you have to use a pdf scraper module in order to parse the contents of the pdf
on windows you would need to activate the virtual environment by venvName\Scripts\activate where venvName is the name of the virtual environment you created.
@@cotsrock9914 I know, it's always frustrating when you can't even setup the environment(no pun intended) before getting to the code. On windows, you would need to use Scripts\activate instead. Let me know if I can help, hours/days of frustration that I've had, I can totally understand ;)
Hey 👋🏻 when i use the scrapy Shell and use the view(response) command i cannot see all of the Html from the Website. I just can see the "cookies accept" window i can accept this in the browser and after that i have a blanc browser. What can i do to fetch the whole html code?
If any one running into programming error for set while writing data in psql then simply change the adapter[price_key] = float(value.replace('£', '')) in process item. I suppose this would work for previous issues too.
I have a question with part4, in part 4 at first you just scraped one page but later on when we want to have all the next pages and modified it, it still shows me the first page, I'm not sure what is the reason. can you help me with that please? Thank you
In part 2 where you typed the activate command with bin in the path, this is not correct for windows installations. It says issue the command .\activate.bat and this worked for me
Hi, I'm trying this on VS studio, and in part 4 after running scrapy crawl bookspider, I'm not yielding any results, I even tried going to the guide and copying the exact code but it's still not yielding any results, anyone know what the issue with this is?
Thank you very much for the good work! Really appreciate the tutorial. I need to point out that MySQL I installed with dmg cannot be used with terminal somehow, so I ended up reinstalled MySQL using terminal.
Is anyone else getting the Line 21 error "NoneType object is not subscriptable" even after fixing the code? I can't seem to get around it. Not even deleting the upc line in both the bookspider and items. I don't really know what to do lol
i've had this problem too, in my case the problem was that in the spider at book_item["price"] I had the following book_item["price"] = response.css("p.price-color ::text").get() AND the correct way was book_item["price"] = response.css("p.price_color ::text").get(), because the price would not return anything
When you save to the database in 2:02:00 ; I had the error because the url was a tuple and 'cannot be converted'. If someone has a similar problem he can just index into the url like this: 'str(item["description"][0])' (instead of the code provided which is this: 'str(item["description"]') in the excute function in the process_item function.
@@ibranhr I found the error by looking at the what is being processed when the error happened. I saw that it was a tuple and fixed it. Try something similar too if you know the error is with converting values.
14:45 source venv/bin/activate is for the mac if youre on window
".\venv\Scripts\activate" use this in your terminal
on windows you just type in the name of the venv file, then \Scripts\activate as long as you are in the project folder.
Example:
PS D:\Projects\Scrapy> .venv\Scripts\activate
wow you are my hero
in case of security issues you might need this too :
Set-ExecutionPolicy Unrestricted -Scope Process
I'm in part 8 and I can't thank you enough for this course! The level of given knowledge is UNREAL !!!
The issue we faced in part 6 was that the values added to the attributes of our `BookItem` instance in the `parse_book_page` method were being passed as `tuples` instead of `strings`. Removing commas at the end of the values should resolve this issue. Once we fix this problem, everything should work perfectly without needing to modify the `process_item` method.
Thanks alot.
goat
FYI for those who want to scrape dynamic websites, dynamic websites needs selenium which is not included in this course.
But no cap, this is a great course.
No 🧢
Is it hard to add Selenium into the web scraping project from this video? Not to sure if that is a dumb question or not, still learning.
@@jamo6857same question did u get the answer ?
So have u found at how to scrape dynamic web..
@@jamo6857 not sure...but for me...I could not use the selenium driver in my pc.
at 52:00 you don't need to check for catalogue, you can just follow the url in the tag and it gives me 1000 items
Note for Windows users:
To activate virtual env, type venv\Scripts\activate
very useful for windows user :)
Didn't work for me. Can't seem to get it to activate
@@entrprnrtim in the Terminal switch PowerShell to cmd
The actual one is .\virtualenv\Scripts\Activate
venv/Scripts/Activate.ps1
I am a python newbie without any experience in coding. With the help of this guide I am able to write a spider and fully understand the architecture. Really helpful👍👍👍They also have other guides to help you polish and functioning your spider, highly recommended!
This is the first coding course I followed up to an end. Nicely taught. Keep it up.
Is it good?
@@riticklath6413 ya
13:37 creating venv
17:45 create scrapy project
29:31 create spider
33:38 shell
1:34:58 instead of using a lot of if statement use mapping.
for example:
# saving the rating of the book as integer
ratings = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
rating = adapter.get("rating")
if rating:
adapter["rating"] = ratings[rating]
This is not only faster but it also looks clean.
Amazing tutorial, I've only gone through half of it, and I can say it's really easy to follow along and it does work ! Thanks a lot !
Thank you for the time you've put into this tutorial. That being said, you should make clear that the setup is different for windows than Mac. No bin folder for example
🎯 Key Takeaways for quick navigation:
00:00 *Scrapy Beginners Course*
01:51 *Scrapy: Open Source Framework*
03:12 *Scrapy vs. Python Requests*
04:24 *Scrapy Benefits & Features*
05:21 *Course Delivery & Resources*
06:18 *Course Outline Overview*
08:20 *Setting Up Python Environment*
16:38 *Creating Scrapy Project*
20:05 *Overview of Scrapy Files*
26:07 *Understanding Settings & Middleware*
27:13 *Settings and pipelines *
28:22 *Creating Scrapy spider *
30:24 *Understanding basic spider structure *
33:32 *Installing IPython for Scrapy shell *
34:27 *Using Scrapy shell for testing *
36:35 *Extracting data using CSS selectors *
38:23 *Extracting book title *
39:43 *Extracting book price *
40:49 *Extracting book URL *
41:18 *Practice using CSS selectors *
42:02 *Looping through book list *
43:15 *Running Scrapy spider *
47:29 *Handling pagination *
53:52 *Debugging and troubleshooting *
56:12 *Moving to detailed data extraction*
Update Next Page
Define Callback Function
Start Flushing Out
Data cleaning process: Remove currency signs, convert prices, format strings, validate data.
Standardization of data: Remove encoding, format category names, trim whitespace.
Pipeline processing: Strip whitespace, convert uppercase to lowercase, clean price data, handle availability.
Converting data types: Convert reviews and star ratings to integers.
Importance of data refinement: Iterative process of refining data and pipeline adjustments.
Saving data to different formats: CSV, JSON, and database (MySQL).
Different methods of saving data: Command line, feed settings, and custom settings.
Setting up MySQL database: Installation, creating a database, installing MySQL connector.
Setting up pipeline for MySQL: Initialize connection and cursor, create table if not exists.
01:56:31 *Create MySQL table*
02:04:42 *Understand user agents*
02:13:03 *Implement user agents*
02:25:01 *Scrapy API request*
02:26:11 *Fake user agents*
02:27:20 *Middleware setup*
02:33:00 *Robots.txt considerations*
02:40:19 *Proxies introduction*
02:42:34 *Proxy lists overview*
02:52:17 *Proxy ports alternative*
02:52:32 *Proxy provider benefits*
02:53:12 *Smartproxy overview*
02:54:44 *Residential vs. Datacenter proxies*
02:55:27 *Smartproxy signup process*
02:56:19 *Configuring Smartproxy settings*
02:58:07 *Adjusting spider settings*
03:00:23 *Creating a custom middleware*
03:01:21 *Setting up middleware parameters*
03:03:02 *Fixing domain allowance*
03:04:17 *Successful proxy usage confirmation*
03:05:00 *Introduction to proxy API endpoints*
03:06:29 *Obtaining API key for proxy API*
03:07:54 *Implementing proxy API usage*
03:10:36 *Ensuring proper function of proxy middleware*
03:12:10 *Simplifying proxy integration with SDK*
03:13:25 *Configuring SDK settings*
03:14:47 *Testing SDK integration*
03:17:56 *Upcoming sections on deployment and scheduling*
03:21:22 *Scrapy D: Free, configuration required.*
03:21:35 *Scrape Ops: UI interface, monitoring, scheduling.*
03:22:02 *Scrapey Cloud: Paid, easy setup, no server needed.*
03:49:42 *Dashboard configuration guide.*
03:51:21 *Set up ScrapeUp account.*
03:52:48 *Install monitoring extension.*
03:55:24 *Server setup instructions.*
04:00:51 *Job status and stats.*
04:01:47 *Analyzing stats for optimization.*
04:02:42 *Integration with ScrapeUp.*
04:18:05 *Scheduler Tab Options*
04:19:14 *Job Comparisons Dashboard*
04:20:15 *Scrappy Cloud Introduction*
04:21:36 *Scrappy Cloud Features*
04:22:20 *Scrappy Cloud Setup*
04:25:33 *Cloud Job Management*
04:28:57 *Scrappy Cloud Summary*
Made with HARPA AI
Thanks!😀
For PART 8 if anybody is having trouble with the new headers not being printed to the terminal, make sure in your settings.py file that you enable the "ScrapeOpsFakeUserAgentMiddleware" in the DOWNLOADER_MIDDLEWARES and not the SPIDER_MIDDLEWARES.
He explained that in the video.
@@jonwinder6622 Yeah after going through it again I realized I missed that detail..
@@SpiritualItachi I dont blame you, its so easy to look over since he literally goes through so much lol
Thanks for another great video FreeCodeCamp! This is something I've wanted to spend more time on for a long time with python!!
for windows users: If you get error first type Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy Unrestricted -Force
and after that type venv\Scripts\activate
This worked for me, many thanks.
it works for me too, thanks pal
thanks mate, i run to this error and go to powershell to find the execution policy and says restricted i thought im stuck.
This is so cool! I was able to follow until Part 6 but from Part 7 I couldn't so I will come back in the future after I have basic knowledge of MYSQL and databases. (Note for myself).
I wasted 30 bucks on udemy courses and they are not nearly as good as this tutorial, thanks man
When selecting a random user agent from your list, you can do random.choice(self.user_agents_list).
Thanks Joe Kearney! Nice course of course. You are good teacher, love
I just finished part 7 and want to thanks for the great tutorial!!
this tutorial really needed the code aspect to help make sense of what is going on and fix errors. thanks
I'm starting this course now and very excited! Thanks for the effort of teaching it
Thanks for this crazy course !!!
This is gold for beginners like me. Tks.
A wonderful video that we've used as a reference for our recent additions. Your sharing is highly appreciated!
Part4 54:07
if you're wondering why the result of 'item_scraped_count' still only 40 probably href is already full url so don't duplicate its domain
teach yourself to improvise💪
if you have problems on 1:17:32 running >>> scrapy crawl bookspider -o bookdata.csv
instead write >>> scrapy crawl -o file.csv -t csv
make a course to block the crypto spammers
btw thanks for the scrapy course, i was searching for this for a while😃
such a complete course...
the course I needed months ago 😭
did you try some other course?
@@_Clipper_ bought two Udemy courses. the tutorials on UA-cam is limited. so is this one.
@@yooujiin are you in data science? I need some recommendations for ML and web scraping. I tried Jose pradila's course and it wasn't very in depth so refunded that. Please recommend only if you are in the same field or have been suggested the same by someone you know in ds/ai/ml.
@@_Clipper_ I'm currently doing my masters in software development. I would love for some recommendations myself. I recommend a the scrapy course by Ahmed Rafik
44:33 my spider doesn’t give back the data from the html, it crawls but stops without having selected any data. I rewrote the code multiple times but it doesn’t change.
*just solved it: had to save the bookspider code first
Thank you so much for providing this content for free. It's truly incredible that anyone with an internet connection can get free coding education, and its all thanks to people like you!
1:24:48 don't forget to remove comas after book_item['url'] = response.url and all others when we add BookItem import. Because i have some values in list instead of string
Please help me, I got 2 errors from this line : from bookscraper.items import BookItem. (errors detected in items and BookItem). Has anyone faced the same issue as me?
They need to make a certification option to whom see all the courses, It'd be so interesting
Very practical and helpful video with very detailed explanation!
Very clear explanation. Many thanks
Nice video! Unfurtunelly part 6 has a lot of code without debug, so it's really hard to fix errors. Something is going worng with my code, but i can't identify
thx! very hard to follow, needed a solid knowledge in python
In case anyone is having a problem activating venv in windows, use the following command. . venv\scripts\activate
02:47:00 - the fun part is that you could... scrap geonode for IP-s 🙂
oh man , he was just showing me how good
his code is !!!!!
Great content! It helped me a lot to understand some concepts better. 💯
CAN SOMEONE HELP ME!!!!!???? At part 3 when you create bookscraper, I don't have bookspider.py created for me. What do i do for it to be generated???? I AM CONFUSED
1:49:03 I am getting a black json file with only [ ] inside the file...terminal returns something to do with process_item...How does one solve this?
sorry I'm new to this.
I learned the basics of Python and now I want to focus on something to get a job, is web scraping a skill that can get you a job on its own?
I watched it twice and I think it can be shortened quite a lot and better organized.
If you are wondering why 'process_request' does not work in part-8 make sure that you enabled 'downloader middlewares' in settings.py, instead of 'spider middlewares'...
Note that I copied the code from the tutorial page for the ScrapeOpsFakeUserAgentMiddleware, and when trying to run it I get the following error: (...) AttributeError: 'dict' object has no attribute 'to_string'.
SOLUTION: copy the process_request function exactly as it is in the video, not like in the tutorial page.
13:49 you created folder named 'part-2' without showing us every single detailed .. please show us everything excatly.
I think you forgot to remove the comma in parse_book_page which is why you needed to convert the tuples
Looking forward to this. A mongodb/pymongo section would be nice for data storage though!
"I have a question, does all the change of agents and proxies once we implement them in our code also reflect in the Shell?"
This was very helpful, thank you so much for sharing all this knowledge free!
if anyone got error during the sql part keep in mind to comment the previous feed setting which we have selected as csv format that was creating error for me.
I have an error with pylance it shows a warning when I'm importing the bookscrap.items , I guess I did something wrong creating the environment
I can't load data from scrapy to sql tables like he did at 2:02:01, I got the column names, but the data is empty, and no errors. Anyone knows why ?
Exactly what I wanted at this moment, Thank you
In part 6, at the start of the process item function, despite having the exact same code as the tutorial my value = adapter.get(field_name) returns the exact value and not a tuple, so it was unnecessary to add the index in the following line, does anyone know why this is happening?
Can scrapy get data from Cloudflare-protected websites? I just want to extract a list of holidays from our country's government websites to automatically store them in a table, but they don't have an API for it.
Ok, so you schedule your spiders using scrapeops. But how do you consume the product of such scrapping? As far as I know it's just being stored in the virtual server. Can you retrieve these with scrapeops?
Great course and thank you for your efforts! But in part 11, aren't you publishing your private scrapeops-api-keys to the public? Isn't that a little bit dangerous? Or to ask differently, what would be a good way to do this instead?
Overall good video I learn lot of things but I thinks you should discuss briefly about css and xpath selectors. I am facing problem on it
For anyone having errors in Part 8 with the fake headers:
You need to import this:
from scrapy.http import Headers
and then in the process_requests function you need to replace this line:
request.headers = Headers(random_browser_header)
Thanks!
Thanks for this
I definitely recommend it to everyone 👌👌👌
How can I make my scrapy. Scrap a specific book in the website. If the user types a title name they will find info only about that book
I don't understand the i python shell activation part because we don't have scrapy setting file
Thank you, thank you, and once again, thank you!
Please help, I keep getting Crawled 0 pages and the output files are always empty
Great content mate really appreciate it!
Thanks for such a wonderful web scraping tutorial. Please make a video tutorial on how to download thousands of pdfs from a website and perform pdf scraping with scrapy. In general, please make a tutorial on pdf scraping as well.
You don’t do the pdf scraping with scrapy- it’s designed for scraping pdfs. You can download the pdfs using scrapy (at least I imagine you can), but you have to use a pdf scraper module in order to parse the contents of the pdf
hard to follow if you are on a windows machine. 15 mins in and I am already lost. There's no bin folder?
same with me
on windows you would need to activate the virtual environment by venvName\Scripts\activate where venvName is the name of the virtual environment you created.
@@cotsrock9914 I know, it's always frustrating when you can't even setup the environment(no pun intended) before getting to the code. On windows, you would need to use Scripts\activate instead. Let me know if I can help, hours/days of frustration that I've had, I can totally understand ;)
Use below command to activate:
.\Scripts\activate
Can you please explain why did we take [1] in li[1] @ 1:05:28 ?
best course ever
thanks for the tutorial, I have a question, which is the best choice for scraping websites , python or node ?
python
I am using nodejs it's much faster ^^ @@jonwinder1861
Hey 👋🏻 when i use the scrapy Shell and use the view(response) command i cannot see all of the Html from the Website. I just can see the "cookies accept" window i can accept this in the browser and after that i have a blanc browser. What can i do to fetch the whole html code?
If any one running into programming error for set while writing data in psql then simply change the adapter[price_key] = float(value.replace('£', '')) in process item. I suppose this would work for previous issues too.
33:15 what to do if a fetch(url) is giving out as ['partial'] I think it is not giving me all the html elements is there any way to handle this?
I don't get the output of all the urls on 53:07 (only 20 items)
Amazing tutorial, I've really enjoyed watching and it helped me a lot with my project.
Great tuturial, Thanks a lot!
I have a question with part4, in part 4 at first you just scraped one page but later on when we want to have all the next pages and modified it, it still shows me the first page, I'm not sure what is the reason. can you help me with that please?
Thank you
i face the same do you find a solution
Thanks alot Freecodecamp for another amazing tutorial ❤️.
In part 2 where you typed the activate command with bin in the path, this is not correct for windows installations. It says issue the command .\activate.bat and this worked for me
Hi, I'm trying this on VS studio, and in part 4 after running scrapy crawl bookspider, I'm not yielding any results, I even tried going to the guide and copying the exact code but it's still not yielding any results, anyone know what the issue with this is?
(1:18) I followed all the instructions but my output includes only title, price and link.
same, you solved?
Thank you very much for the good work! Really appreciate the tutorial.
I need to point out that MySQL I installed with dmg cannot be used with terminal somehow, so I ended up reinstalled MySQL using terminal.
in part 4 i have followed word for word ur code but on my side instead of getting 1000 item scraped count am only getting 20 help pls
same'
i fixed it, save your file then run it
i want to know how to learn python scrapy
omg , soo complicated , but ill sit trough !
Is anyone else getting the Line 21 error "NoneType object is not subscriptable" even after fixing the code? I can't seem to get around it. Not even deleting the upc line in both the bookspider and items. I don't really know what to do lol
i've had this problem too, in my case the problem was that in the spider at book_item["price"] I had the following book_item["price"] = response.css("p.price-color ::text").get() AND the correct way was book_item["price"] = response.css("p.price_color ::text").get(), because the price would not return anything
i too
after solving the bug ....also getting error same as u...
When you save to the database in 2:02:00 ; I had the error because the url was a tuple and 'cannot be converted'. If someone has a similar problem he can just index into the url like this: 'str(item["description"][0])' (instead of the code provided which is this: 'str(item["description"]') in the excute function in the process_item function.
I’m still having the errors bro
@@ibranhr I found the error by looking at the what is being processed when the error happened. I saw that it was a tuple and fixed it. Try something similar too if you know the error is with converting values.
is there a way to extract the data in a table when the rows dont always correspond to the same fields? do we have to make some sort of mapping table?
54:36 - 18/05
1:23:16 - 26/05
1:44:19 - 14/06
that's what i need! 👍👍👍
Is this course enough to do scraping tasks on freelancing websites
If it's not could anyone mention what should I do after I finish this
Thank you very much for this great course. I really learned a lot.
❤❤❤
god bless the internet and freecodecamp! thanks !
Can we scrape dynamic javascript webpages through scrappy ?
yes
just in time, thnx tho
I didn't knew what i will do for a project i'm working on till i watched the video
life saver