@Vishal Gupta that website is using a Javascript to load the content. But first try using the library explained in this video by John. It looks like you can get the work done through it. (i haven't used it myself so cant vouch for it) Anyhow is this library fails, you can definitely use selenium and get your work done. Selenium opens up the page in some of its browser and then load the page there which loads all of the page contents and in fact gives you the option of clicking at a particular web element. A tip: just load the page by selenium library. Then pass source code of that page into the bs4 also know as BeautifulSoup library and scrap the site in normal way from there on. It's essential because selenium's methods for extracting information out of website takes a lot amout of time and bs4 is much faster instead and has better error handling.
THANK YOU for this video and all the others. I am learning web scraping to gather data for my PhD thesis and you have helped me make such great progress in just a few days. :)
I can get data from static websites using scrapy with relative ease, but I always come unstuck when I try the same with dynamic websites; I might give "html_requests' a go instead of my usual scrapy-selenium combo...Thanks for the video! 👊👊👊
this was super useful! I have a project rn that needs to scrape on many pages that need renderer. This looks much more lightweight than what I'm using rn (selenium)
When I use Xpath, in products (on a different site, but same principles) terminal keeps returning 'None', the site is gwt based, would that affect xpath from working?
Hi, I tried your code on other website, but when I arrived at print(products) part, it returns 'NoneType' object. The code get no url. What should I do?. I tried to use the user-agent, but also return nothing
Hello John, if i add command r.html.render(sleep=1) the output be "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.", i am anything on google, no clue, any idea?
Hiya! Are you running it in a jupyter notebook or similar? The way they work conflicts with the render function - try running it in vs code or similar and that should work
@@JohnWatsonRooney its running on vs code, but i got new error python .\coba.py Traceback (most recent call last): File ".\coba.py", line 19, in print(r.html.xpath("//div[@class='span6']/h1", first=True).text) AttributeError: 'NoneType' object has no attribute 'text', can you tell me where do i go wrong?
Nice video - minus the try/catch with no specific exception. I know this is a tutorial, but that’s a bad habit to share. Regardless, thank you for the content.
John: when I follow your code, @ "for item in products.absolute_links:, although I specify, e.g. 'div.product-subtext', the iteration only returns the item.text, (the link text of item) and not the sub-text of the item. This is true of price, name, and so-forth. Can you explain this behavior?
Hi John and everyone, I'm having trouble with the html.render() method, I'd appreciate any help. First time the method runs, it downloads chromium. After I ran it, 3 red lines were printed (Downloading Chromium & stuff I can't remember), I felt like it took too long (more than 10 minutes), so I stopped the program. Now when I try to run a the method, the script just get stucked, I mean, it is running, but never continues to the lines after the html.render method. No errors are raising, the script simply never finishes to run. I tried to pip uninstall requests-html and reinstall it but I'm getting the same not indicative result. How can I troubleshoot this problem? I'm excite to work with requests-HTML and letting for of Selenium for standard rendering needs, but I can't. Thanks a lot for anyone who cares enough to give it a try.
Hi John, I am one of your fans. I really wonder how did you learn these techniques before? I am currently in a status that don't know how to be a self-taught web scrapper. In other words, I don't know how to learn from a myriad of knowledges on the internet. But fortunately, I found you
I haven’t done any proper speed tests but they do essentially the same thing so I think it would be marginal. Requests-html has the benefit of being a python package so if that works for your needs I’d use that. Splash has the benefits of scripting though- video to come!
Thank you so much. Your video is going to help me a lot in a project that I'm going to start. One question if you don't mind, when I want to gather text but there is a part of the text is appearing and there is a[ click for more] ~>hyperlink, that prevents the text from being fully copied to the csv file. Do you have a hint or suggestions? I appreciate your help in advance
John, I've done some code web-scraping dynamically like you in this video. But it's taking too much time because for every product it has to open its page. Is it common, is there a faster way for doing this?
@@JohnWatsonRooney Looking forward to it..Easing login efforts in flash enabled sites such as gmail or any.Any references now would be much helpful for me in my project!
Nice! - is there a way of doing this for the _currently displayed page_ ? - on a YT video page I want to scrape all the recommended videos and their titles from that page . .
Hey john, After struggling with stackoverflow I am here finally..."response.html.render(sleep=3)" is giving error in django view (i .e There is no current event loop in thread 'uWSGIWorker1Core8') .....can you help me how to solve this??
@@JohnWatsonRooney Do you have any videos focusing on if statements and/or keyword lists such as changing results, for example; Junior = Entry Level Early Professional = Entry Level Graduate = Entry Level etc...
I am trying using website the data shown after write in input field , otherwise the html element empty what should I use? I am using pyautogui to fill the field but I don’t know how to read the data
Hi john , trying to run the code , i got this error with render ; AttributeError: 'Future' object has no attribute 'html' ...any help please , didn't find in google .Thanks ,
I've followed you code to the tee. It locks up both at Pycharm and VScode at the render statement (r.html.render(sleep=1)). I literally have to close both programs to get them to run again. Any ideas? Great video though.
If it’s the first time running the render method it should download headless chrome - I’m guessing it’s getting stuck there. Maybe try removing requests_html and reinstalling it
Hi John is it possible to parse the requests-html response with bs4? I've tried passing response.text when making a bs4 Soup but it returns None. Can somebody help me?
Hi, yes it is - I’m sure I’ve covered that before. It’s quite a useful method. Try printing the html before making the soup and check is it what you were expecting to see
Hey John, great videos. Thank you so much for it! I wanted to ask, how can I scrape multiple categories(Categories like /computers, /headphones, /monitors/, /keyboards/), do you have any video or idea for that? Thanks for your content!
With requests_html, when I print the soup, I am getting the message - you are not authorized... in the page html. I tried loading the page manually , it worked , so my IP isnt blocked. Can anyone help me with this.
hi, i'm trying to retrieve the data (the list of employers /vacations initiated by jquery code) from the Canadian job bank, i made "get" request but won't be able to get inner response payload data www.jobbank.gc.ca/jobsearch/jobsearch?searchstring=&locationstring=&sort=M, i can see this payload in the firefox developer tool but failed to find the proper python library and methods to get it, is there any way other than selenium to accomplish this task? I am at the very beginning of the path of learning programming and would be grateful for any help or advice on what to read or watch to figure it out. thanks.
I think you might need to use the same approach as my sports stats video - using postman to replicate the request made by your browser, the copy that over to your python code
Hey John, very helpful video, but I keep having this one issue when I try to render the url, I get this error message: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
If the website is html (no JavaScript) requests and bs4 will be the lightest in my opinion. The method in this video is slower due to the render process but still good for smaller projects - selenium is the slowest and not really designed for scraping but does work when needed
While trying to get product links on the category page of the site I work for, it also takes an extra 2 links I don't want for each product. How can I remove these links that I don't want or only one word exists in the links I want, how can I get links with only that word?
that's a bit more tricky without browser automation (selenium) we can use "r.html.render(url, sleep=1, scrolldown=x)" - where x is the ammount of times to page down. Not ideal but might work
dude update this code I try to run the request_html but all it state is that it need chromium to work but the thing is I have chromium on my machine even the binary file too. why when I run it, it attempt to down chromium which I already have and fail to find it. I try this already a few months back now I retuen to the same I even uninstall and install everything but tge same problem.
Hope you don't mind me asking but I have been banging my head against this one for a few hours... but I am trying to pick up only a specific url from a container (the container has non product URL's). : " from requests_html import HTMLSession import pandas as pd import time url = 'www.fragrancenet.com/fragrances' s = HTMLSession() r = s.get(url) r.html.render(sleep=1) products = r.html.xpath('//*[@id="resultSet"]', first=True) print(products.absolute_links) " I am only looking for the p-tags under Result set called: Any help would be super appreciated, thanks again John.
Hi John, I watched this video many times, you're great at explaining. However, I am getting this error "Navigation Timeout Exceeded: 8000 ms exceeded" when r.html.render(sleep=1) I even bumped up the sleep time. Please help.
Cider | 4.0% | 44 cl Trying this: info=r.html.find('div.Select an element with a CSS Selector:',first=True).text The output shows:AttributeError: 'NoneType' object has no attribute 'text'
with requests-html "find" always returns a list, but using first=True forces it to return only a single item, the first element it finds that matches your find criteria
Hello, can you do scraping on this page : stats.nba.com/teams/transition/ I want to compare playtype team1 percentile on offense (also the frequency) against team2 percentile on defense. can you help me, please?
Are you using chrome or Firefox? That looks like the “full xpath” option, as opposed to just the “xpath”. I am planning to do a video on xpaths to clear it up a bit more
i have a problem with this code produk = r.html.xpath('/html/body/div[4]/div[2]/div[2]/div[2]/div[1]/div/div[2]',first=True)...the result is None or []..how to fix it?
Keyboard too loud? I've been using my mech kb again.. Is it too distracting?
i think its fine, at least i didnt get distracted
@Vishal Gupta that website is using a Javascript to load the content.
But first try using the library explained in this video by John. It looks like you can get the work done through it.
(i haven't used it myself so cant vouch for it) Anyhow is this library fails, you can definitely use selenium and get your work done. Selenium opens up the page in some of its browser and then load the page there which loads all of the page contents and in fact gives you the option of clicking at a particular web element.
A tip: just load the page by selenium library. Then pass source code of that page into the bs4 also know as BeautifulSoup library and scrap the site in normal way from there on. It's essential because selenium's methods for extracting information out of website takes a lot amout of time and bs4 is much faster instead and has better error handling.
Not at all, makes it feel like you are working away!
I enjoy the sound. It's like in hackers in the movies :)
kind of enjoyed it !
i'm going through ALL of your videos and just finished this one! learning so much it's incredible!
THANK YOU for this video and all the others. I am learning web scraping to gather data for my PhD thesis and you have helped me make such great progress in just a few days. :)
Amazing explanation skills! Everything was clear. One of the greatest video for web scraping so far! Good job, Good luck!!
Man this is some amazing content. So glad i found your channel! Definitely earned a subscribe.
Thanks!
Lifesaver! Thank you so much! Wish you the best of luck with your channel!
I can get data from static websites using scrapy with relative ease, but I always come unstuck when I try the same with dynamic websites; I might give "html_requests' a go instead of my usual scrapy-selenium combo...Thanks for the video! 👊👊👊
Glad you liked it - give it a go. I believe scrapy-splash is an add on for scrapy that can reload dynamic pages but I’m yet to try it
this was super useful! I have a project rn that needs to scrape on many pages that need renderer. This looks much more lightweight than what I'm using rn (selenium)
Great video John as always - Thanks!
Thank you!
You are a great and creative person...keep going champ.
Awesome!, I was searching for such type of scraping , and I found
You are a truly life saver. great great video. thanks mate
When I use Xpath, in products (on a different site, but same principles) terminal keeps returning 'None', the site is gwt based, would that affect xpath from working?
Hi, I tried your code on other website, but when I arrived at print(products) part, it returns 'NoneType' object. The code get no url. What should I do?. I tried to use the user-agent, but also return nothing
Hello John,
if i add command r.html.render(sleep=1) the output be "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.", i am anything on google, no clue, any idea?
Hiya! Are you running it in a jupyter notebook or similar? The way they work conflicts with the render function - try running it in vs code or similar and that should work
@@JohnWatsonRooney its running on vs code, but i got new error
python .\coba.py
Traceback (most recent call last):
File ".\coba.py", line 19, in
print(r.html.xpath("//div[@class='span6']/h1", first=True).text)
AttributeError: 'NoneType' object has no attribute 'text', can you tell me where do i go wrong?
Very clearly explained. May I ask if there is a GitHub repo containing the code that you used in the video?
You missed an explanation: what circumstances should you use xpath v div.?
Nice video - minus the try/catch with no specific exception. I know this is a tutorial, but that’s a bad habit to share. Regardless, thank you for the content.
Thanks, and yes you are absolutely right, I don’t do that anymore!
John: when I follow your code, @ "for item in products.absolute_links:, although I specify, e.g. 'div.product-subtext', the iteration only returns the item.text, (the link text of item) and not the sub-text of the item. This is true of price, name, and so-forth. Can you explain this behavior?
Hi John and everyone, I'm having trouble with the html.render() method, I'd appreciate any help.
First time the method runs, it downloads chromium. After I ran it, 3 red lines were printed (Downloading Chromium & stuff I can't remember), I felt like it took too long (more than 10 minutes), so I stopped the program.
Now when I try to run a the method, the script just get stucked, I mean, it is running, but never continues to the lines after the html.render method. No errors are raising, the script simply never finishes to run.
I tried to pip uninstall requests-html and reinstall it but I'm getting the same not indicative result.
How can I troubleshoot this problem? I'm excite to work with requests-HTML and letting for of Selenium for standard rendering needs, but I can't.
Thanks a lot for anyone who cares enough to give it a try.
Hi John, I am one of your fans. I really wonder how did you learn these techniques before? I am currently in a status that don't know how to be a self-taught web scrapper. In other words, I don't know how to learn from a myriad of knowledges on the internet. But fortunately, I found you
Hey john awesome video (like always). I have a question, in terms of speed would you recommend a splash or request_html?
I haven’t done any proper speed tests but they do essentially the same thing so I think it would be marginal. Requests-html has the benefit of being a python package so if that works for your needs I’d use that. Splash has the benefits of scripting though- video to come!
@@JohnWatsonRooney Thanks, this helps so much.
OMG! I would like to hit the "like" button a million times!
Thank you very much!
bravo sir, you gave me my eureka moment 👏
Amazing video. I'm wondering how can we scrape all the pictures for the product if they are rendered dynamically (like in a slideshow)
Thank you so much. Your video is going to help me a lot in a project that I'm going to start. One question if you don't mind, when I want to gather text but there is a part of the text is appearing and there is a[ click for more] ~>hyperlink, that prevents the text from being fully copied to the csv file. Do you have a hint or suggestions? I appreciate your help in advance
very helpful tutorial , thank you for your efforts
John, I've done some code web-scraping dynamically like you in this video. But it's taking too much time because for every product it has to open its page. Is it common, is there a faster way for doing this?
great as always, thanks!
Thanks, again super easy to follow!
Thank you very much! Appreciate it.
Can You explain when should we use what??
I generally prefer sticking to selenium for all my needs.
i tried that in jupyter and it gave me this error message: **'Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.'**
You are the best, subscribed
Fantastic demonstration.Would love to know how can we use this module to submit forms or logins
Sure that’s a good idea , I will look into it
@@JohnWatsonRooney Looking forward to it..Easing login efforts in flash enabled sites such as gmail or any.Any references now would be much helpful for me in my project!
Nice! - is there a way of doing this for the _currently displayed page_ ? - on a YT video page I want to scrape all the recommended videos and their titles from that page . .
Hey john, After struggling with stackoverflow I am here finally..."response.html.render(sleep=3)" is giving error in django view (i .e There is no current event loop in thread 'uWSGIWorker1Core8') .....can you help me how to solve this??
How can we use threading while scraping thousands of website links?
What's the difference between using requestes-html vs. scrapy or selenium?
selenium is a tool used for different purpose its by product is EXCELLENT ease in web scraping..its been 2 yrs though
Great video and easy to follow for a noob like me! Appreciate it :D
:D thank you
@@JohnWatsonRooney Do you have any videos focusing on if statements and/or keyword lists such as changing results, for example;
Junior = Entry Level
Early Professional = Entry Level
Graduate = Entry Level
etc...
Can a modified version of this work on scraping links listed inside a live chat feed?
That’s not something I’ve tried but yes I think so
Hi, thank you so much for your video. I want to ask how to scrape multiple review page in one product? I get confuse
I am trying using website the data shown after write in input field , otherwise the html element empty what should I use? I am using pyautogui to fill the field but I don’t know how to read the data
hmm the website I am trying to scrape returns status code 429... and I haven't even started scraping. Do you know what could be causing it?
Hi john , trying to run the code , i got this error with render ; AttributeError: 'Future' object has no attribute 'html' ...any help please , didn't find in google .Thanks ,
can't install requests-html, any ideas? I'm using windows and the error jumps with lxml , tried to install lxml and got same error
Thank you man really useful !!
I've followed you code to the tee. It locks up both at Pycharm and VScode at the render statement (r.html.render(sleep=1)). I literally have to close both programs to get them to run again. Any ideas? Great video though.
If it’s the first time running the render method it should download headless chrome - I’m guessing it’s getting stuck there. Maybe try removing requests_html and reinstalling it
after use render I got this error: There is no current event loop in thread 'Thread-5 (process_request_thread)'
oops...You are legend...........I am blind...This is also in docs on top layer 😂(I think I need some sleep)
Hi John is it possible to parse the requests-html response with bs4? I've tried passing response.text when making a bs4 Soup but it returns None.
Can somebody help me?
Hi, yes it is - I’m sure I’ve covered that before. It’s quite a useful method. Try printing the html before making the soup and check is it what you were expecting to see
Hey John, great videos. Thank you so much for it! I wanted to ask, how can I scrape multiple categories(Categories like /computers, /headphones, /monitors/, /keyboards/), do you have any video or idea for that?
Thanks for your content!
Hi bro, Did you find any technique to scrape multiple categories? Please let me know.
With requests_html, when I print the soup, I am getting the message - you are not authorized... in the page html. I tried loading the page manually , it worked , so my IP isnt blocked. Can anyone help me with this.
hi, i'm trying to retrieve the data (the list of employers /vacations initiated by jquery code) from the Canadian job bank, i made "get" request but won't be able to get inner response payload data www.jobbank.gc.ca/jobsearch/jobsearch?searchstring=&locationstring=&sort=M, i can see this payload in the firefox developer tool but failed to find the proper python library and methods to get it, is there any way other than selenium to accomplish this task? I am at the very beginning of the path of learning programming and would be grateful for any help or advice on what to read or watch to figure it out. thanks.
I think you might need to use the same approach as my sports stats video - using postman to replicate the request made by your browser, the copy that over to your python code
@@JohnWatsonRooney ok, ill try this out. thanks anyway)
Thank you sir. This make sense to me
what can i do if the xpath search doesn't find anything?
Hi, first tanks a lot for your tutorial. I have a question, i generate my csv file , but my separator are ',' how can i change the separator ?
Sure, after the csv file name, add in sep=“ “ and put in what separator you want to use
Hey John, very helpful video, but I keep having this one issue when I try to render the url, I get this error message: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
Were you able to fix it? I'm having the same problem
@@alokyathiraj Im having a similar issue as well
I can't get this to work either. I think maybe the library needs to be updated.
Nice video, but according to you which python webscraper takes less resources like memory etc?
If the website is html (no JavaScript) requests and bs4 will be the lightest in my opinion. The method in this video is slower due to the render process but still good for smaller projects - selenium is the slowest and not really designed for scraping but does work when needed
@@JohnWatsonRooney this information absolutely u must explain in single video, from fastest method and the slowest one, thx sir,
Great job keep it up keep useful
Thank you♥️♥️ you are BEST💪
While trying to get product links on the category page of the site I work for, it also takes an extra 2 links I don't want for each product. How can I remove these links that I don't want or only one word exists in the links I want, how can I get links with only that word?
hello i'm having a chromium related error when i want to render an html page can you pls tell me how can i fix it?
thanks for learning
Sir how will you deal with infinite scrolling if can't find easy
that's a bit more tricky without browser automation (selenium) we can use "r.html.render(url, sleep=1, scrolldown=x)" - where x is the ammount of times to page down. Not ideal but might work
Trying to recreate on a similar e-commerce website and print(products) from 4:57 gives None type. Any suggestions why?
Just print html source code, look if u looking out there
using render for first time i haven't been able to install any thing and its giving me error
Amazing sir please keep posting videos like this we will help u to increase subscriber number
Good one, Any idea to do the same for laravel baes??
Can you login a website using requests-html?
You can yes, you can post to the server - I have an older video on my channel where I cover the basics of this if you are interested
you. are. awesome!
Could you do a solid for me, I’ve suffered trying to scrape this site
dude update this code I try to run the request_html but all it state is that it need chromium to work but the thing is I have chromium on my machine even the binary file too. why when I run it, it attempt to down chromium which I already have and fail to find it. I try this already a few months back now I retuen to the same I even uninstall and install everything but tge same problem.
nice video 👌 and keep going
good explanation
Hope you don't mind me asking but I have been banging my head against this one for a few hours... but I am trying to pick up only a specific url from a container (the container has non product URL's).
:
"
from requests_html import HTMLSession
import pandas as pd
import time
url = 'www.fragrancenet.com/fragrances'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
products = r.html.xpath('//*[@id="resultSet"]', first=True)
print(products.absolute_links)
"
I am only looking for the p-tags under Result set called:
Any help would be super appreciated, thanks again John.
Hi John, I watched this video many times, you're great at explaining. However, I am getting this error "Navigation Timeout Exceeded: 8000 ms exceeded" when r.html.render(sleep=1) I even bumped up the sleep time. Please help.
Try using timeout=(number you'd like for more than 8s) instead of sleep. worked for me
@@pranitganvir449 thanks for the advice, it worked with timeout=30 and also added keep_page=True
Hi , can you help me ??
Ohhhh nice I use an API that uses this method
it doesnt work with aliexpress
can you do scrapping video for tracton gyan website?
Genial el video, no conocia opcion, normalmente usaba bs4
you shouldn't be john rooney, you should be john legend
what is better bs4 or html.xpath ???
learn to use both but generally if i can i use BS4
hi sir, can you fix this problem :
AttributeError: 'NoneType' object has no attribute 'text'
Thanks, btw nice vid
Awesome
used the same code and It didn't work for me, I changed the website to my desired one and I get a bunch of errors... :(
Thank Bro
Cider | 4.0% | 44 cl
Trying this: info=r.html.find('div.Select an element with a CSS Selector:',first=True).text
The output shows:AttributeError: 'NoneType' object has no attribute 'text'
probably you choose your class incorrectly, that's why you have no elements in your output. Non type means you have no result( empty array).
What does first=True do?
with requests-html "find" always returns a list, but using first=True forces it to return only a single item, the first element it finds that matches your find criteria
@@JohnWatsonRooney got it, thanks. On to pt2!
🖤👌🏻
Hello, can you do scraping on this page : stats.nba.com/teams/transition/
I want to compare playtype team1 percentile on offense (also the frequency) against team2 percentile on defense. can you help me, please?
Hi! Yes I can scrape that site - I have a video coming this week that scrapes a site simliar that you will be able to apply to this site too. JR
@@JohnWatsonRooney Great! Thank you for the really quick answer!
So, when copy the Xpath, I get this as a result:
/html/body/div[7]/div[4]/section/div[10]/div[3]/div[2]/div[2]/div[1]/ul[2]
Are you using chrome or Firefox? That looks like the “full xpath” option, as opposed to just the “xpath”. I am planning to do a video on xpaths to clear it up a bit more
@@JohnWatsonRooney inspector in Firefox, which leads me to think, then, that there's a difference btw Chrome and Firefox ?
There shouldn’t be but I have seen different results from both
when I am trying to type r.html.render() I get this Unresolved attribute reference 'html' for class 'Response'
Beerwulf is not a dynamic site....LOL
The accent, where are you from?
UK near London
Great, but i took a lot of time for rendering
🥰🥰🥰🥰
Didn't you get any other site instead of bear website?
Why are you promoting harmful things?
i have a problem with this code produk = r.html.xpath('/html/body/div[4]/div[2]/div[2]/div[2]/div[1]/div/div[2]',first=True)...the result is None or []..how to fix it?
I'm having the same problem, did you find a solution?
Thanks
r.html.render() not working. What can i do ?
Did you find a void?