@Vishal Gupta that website is using a Javascript to load the content. But first try using the library explained in this video by John. It looks like you can get the work done through it. (i haven't used it myself so cant vouch for it) Anyhow is this library fails, you can definitely use selenium and get your work done. Selenium opens up the page in some of its browser and then load the page there which loads all of the page contents and in fact gives you the option of clicking at a particular web element. A tip: just load the page by selenium library. Then pass source code of that page into the bs4 also know as BeautifulSoup library and scrap the site in normal way from there on. It's essential because selenium's methods for extracting information out of website takes a lot amout of time and bs4 is much faster instead and has better error handling.
THANK YOU for this video and all the others. I am learning web scraping to gather data for my PhD thesis and you have helped me make such great progress in just a few days. :)
I can get data from static websites using scrapy with relative ease, but I always come unstuck when I try the same with dynamic websites; I might give "html_requests' a go instead of my usual scrapy-selenium combo...Thanks for the video! 👊👊👊
this was super useful! I have a project rn that needs to scrape on many pages that need renderer. This looks much more lightweight than what I'm using rn (selenium)
When I use Xpath, in products (on a different site, but same principles) terminal keeps returning 'None', the site is gwt based, would that affect xpath from working?
Hi, I tried your code on other website, but when I arrived at print(products) part, it returns 'NoneType' object. The code get no url. What should I do?. I tried to use the user-agent, but also return nothing
Hello John, if i add command r.html.render(sleep=1) the output be "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.", i am anything on google, no clue, any idea?
Hiya! Are you running it in a jupyter notebook or similar? The way they work conflicts with the render function - try running it in vs code or similar and that should work
@@JohnWatsonRooney its running on vs code, but i got new error python .\coba.py Traceback (most recent call last): File ".\coba.py", line 19, in print(r.html.xpath("//div[@class='span6']/h1", first=True).text) AttributeError: 'NoneType' object has no attribute 'text', can you tell me where do i go wrong?
Hi John and everyone, I'm having trouble with the html.render() method, I'd appreciate any help. First time the method runs, it downloads chromium. After I ran it, 3 red lines were printed (Downloading Chromium & stuff I can't remember), I felt like it took too long (more than 10 minutes), so I stopped the program. Now when I try to run a the method, the script just get stucked, I mean, it is running, but never continues to the lines after the html.render method. No errors are raising, the script simply never finishes to run. I tried to pip uninstall requests-html and reinstall it but I'm getting the same not indicative result. How can I troubleshoot this problem? I'm excite to work with requests-HTML and letting for of Selenium for standard rendering needs, but I can't. Thanks a lot for anyone who cares enough to give it a try.
John: when I follow your code, @ "for item in products.absolute_links:, although I specify, e.g. 'div.product-subtext', the iteration only returns the item.text, (the link text of item) and not the sub-text of the item. This is true of price, name, and so-forth. Can you explain this behavior?
John, I've done some code web-scraping dynamically like you in this video. But it's taking too much time because for every product it has to open its page. Is it common, is there a faster way for doing this?
I haven’t done any proper speed tests but they do essentially the same thing so I think it would be marginal. Requests-html has the benefit of being a python package so if that works for your needs I’d use that. Splash has the benefits of scripting though- video to come!
Nice! - is there a way of doing this for the _currently displayed page_ ? - on a YT video page I want to scrape all the recommended videos and their titles from that page . .
hi, i'm trying to retrieve the data (the list of employers /vacations initiated by jquery code) from the Canadian job bank, i made "get" request but won't be able to get inner response payload data www.jobbank.gc.ca/jobsearch/jobsearch?searchstring=&locationstring=&sort=M, i can see this payload in the firefox developer tool but failed to find the proper python library and methods to get it, is there any way other than selenium to accomplish this task? I am at the very beginning of the path of learning programming and would be grateful for any help or advice on what to read or watch to figure it out. thanks.
I think you might need to use the same approach as my sports stats video - using postman to replicate the request made by your browser, the copy that over to your python code
Hey John, great videos. Thank you so much for it! I wanted to ask, how can I scrape multiple categories(Categories like /computers, /headphones, /monitors/, /keyboards/), do you have any video or idea for that? Thanks for your content!
Thank you so much. Your video is going to help me a lot in a project that I'm going to start. One question if you don't mind, when I want to gather text but there is a part of the text is appearing and there is a[ click for more] ~>hyperlink, that prevents the text from being fully copied to the csv file. Do you have a hint or suggestions? I appreciate your help in advance
Hi john , trying to run the code , i got this error with render ; AttributeError: 'Future' object has no attribute 'html' ...any help please , didn't find in google .Thanks ,
If the website is html (no JavaScript) requests and bs4 will be the lightest in my opinion. The method in this video is slower due to the render process but still good for smaller projects - selenium is the slowest and not really designed for scraping but does work when needed
I've followed you code to the tee. It locks up both at Pycharm and VScode at the render statement (r.html.render(sleep=1)). I literally have to close both programs to get them to run again. Any ideas? Great video though.
If it’s the first time running the render method it should download headless chrome - I’m guessing it’s getting stuck there. Maybe try removing requests_html and reinstalling it
Hey John, very helpful video, but I keep having this one issue when I try to render the url, I get this error message: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
@@JohnWatsonRooney Looking forward to it..Easing login efforts in flash enabled sites such as gmail or any.Any references now would be much helpful for me in my project!
Hi John is it possible to parse the requests-html response with bs4? I've tried passing response.text when making a bs4 Soup but it returns None. Can somebody help me?
Hi, yes it is - I’m sure I’ve covered that before. It’s quite a useful method. Try printing the html before making the soup and check is it what you were expecting to see
I am trying using website the data shown after write in input field , otherwise the html element empty what should I use? I am using pyautogui to fill the field but I don’t know how to read the data
Hey john, After struggling with stackoverflow I am here finally..."response.html.render(sleep=3)" is giving error in django view (i .e There is no current event loop in thread 'uWSGIWorker1Core8') .....can you help me how to solve this??
While trying to get product links on the category page of the site I work for, it also takes an extra 2 links I don't want for each product. How can I remove these links that I don't want or only one word exists in the links I want, how can I get links with only that word?
that's a bit more tricky without browser automation (selenium) we can use "r.html.render(url, sleep=1, scrolldown=x)" - where x is the ammount of times to page down. Not ideal but might work
@@JohnWatsonRooney Do you have any videos focusing on if statements and/or keyword lists such as changing results, for example; Junior = Entry Level Early Professional = Entry Level Graduate = Entry Level etc...
Nice video - minus the try/catch with no specific exception. I know this is a tutorial, but that’s a bad habit to share. Regardless, thank you for the content.
Hi John, I am one of your fans. I really wonder how did you learn these techniques before? I am currently in a status that don't know how to be a self-taught web scrapper. In other words, I don't know how to learn from a myriad of knowledges on the internet. But fortunately, I found you
With requests_html, when I print the soup, I am getting the message - you are not authorized... in the page html. I tried loading the page manually , it worked , so my IP isnt blocked. Can anyone help me with this.
Hi John, I watched this video many times, you're great at explaining. However, I am getting this error "Navigation Timeout Exceeded: 8000 ms exceeded" when r.html.render(sleep=1) I even bumped up the sleep time. Please help.
Hope you don't mind me asking but I have been banging my head against this one for a few hours... but I am trying to pick up only a specific url from a container (the container has non product URL's). : " from requests_html import HTMLSession import pandas as pd import time url = 'www.fragrancenet.com/fragrances' s = HTMLSession() r = s.get(url) r.html.render(sleep=1) products = r.html.xpath('//*[@id="resultSet"]', first=True) print(products.absolute_links) " I am only looking for the p-tags under Result set called: Any help would be super appreciated, thanks again John.
dude update this code I try to run the request_html but all it state is that it need chromium to work but the thing is I have chromium on my machine even the binary file too. why when I run it, it attempt to down chromium which I already have and fail to find it. I try this already a few months back now I retuen to the same I even uninstall and install everything but tge same problem.
with requests-html "find" always returns a list, but using first=True forces it to return only a single item, the first element it finds that matches your find criteria
Hello, can you do scraping on this page : stats.nba.com/teams/transition/ I want to compare playtype team1 percentile on offense (also the frequency) against team2 percentile on defense. can you help me, please?
Cider | 4.0% | 44 cl Trying this: info=r.html.find('div.Select an element with a CSS Selector:',first=True).text The output shows:AttributeError: 'NoneType' object has no attribute 'text'
Are you using chrome or Firefox? That looks like the “full xpath” option, as opposed to just the “xpath”. I am planning to do a video on xpaths to clear it up a bit more
i have a problem with this code produk = r.html.xpath('/html/body/div[4]/div[2]/div[2]/div[2]/div[1]/div/div[2]',first=True)...the result is None or []..how to fix it?
Keyboard too loud? I've been using my mech kb again.. Is it too distracting?
i think its fine, at least i didnt get distracted
@Vishal Gupta that website is using a Javascript to load the content.
But first try using the library explained in this video by John. It looks like you can get the work done through it.
(i haven't used it myself so cant vouch for it) Anyhow is this library fails, you can definitely use selenium and get your work done. Selenium opens up the page in some of its browser and then load the page there which loads all of the page contents and in fact gives you the option of clicking at a particular web element.
A tip: just load the page by selenium library. Then pass source code of that page into the bs4 also know as BeautifulSoup library and scrap the site in normal way from there on. It's essential because selenium's methods for extracting information out of website takes a lot amout of time and bs4 is much faster instead and has better error handling.
Not at all, makes it feel like you are working away!
I enjoy the sound. It's like in hackers in the movies :)
kind of enjoyed it !
i'm going through ALL of your videos and just finished this one! learning so much it's incredible!
THANK YOU for this video and all the others. I am learning web scraping to gather data for my PhD thesis and you have helped me make such great progress in just a few days. :)
Amazing explanation skills! Everything was clear. One of the greatest video for web scraping so far! Good job, Good luck!!
Man this is some amazing content. So glad i found your channel! Definitely earned a subscribe.
Thanks!
Lifesaver! Thank you so much! Wish you the best of luck with your channel!
Great video John as always - Thanks!
Thank you!
Thanks, again super easy to follow!
Thank you very much! Appreciate it.
I can get data from static websites using scrapy with relative ease, but I always come unstuck when I try the same with dynamic websites; I might give "html_requests' a go instead of my usual scrapy-selenium combo...Thanks for the video! 👊👊👊
Glad you liked it - give it a go. I believe scrapy-splash is an add on for scrapy that can reload dynamic pages but I’m yet to try it
this was super useful! I have a project rn that needs to scrape on many pages that need renderer. This looks much more lightweight than what I'm using rn (selenium)
When I use Xpath, in products (on a different site, but same principles) terminal keeps returning 'None', the site is gwt based, would that affect xpath from working?
You are a great and creative person...keep going champ.
Hi, I tried your code on other website, but when I arrived at print(products) part, it returns 'NoneType' object. The code get no url. What should I do?. I tried to use the user-agent, but also return nothing
Hello John,
if i add command r.html.render(sleep=1) the output be "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.", i am anything on google, no clue, any idea?
Hiya! Are you running it in a jupyter notebook or similar? The way they work conflicts with the render function - try running it in vs code or similar and that should work
@@JohnWatsonRooney its running on vs code, but i got new error
python .\coba.py
Traceback (most recent call last):
File ".\coba.py", line 19, in
print(r.html.xpath("//div[@class='span6']/h1", first=True).text)
AttributeError: 'NoneType' object has no attribute 'text', can you tell me where do i go wrong?
Awesome!, I was searching for such type of scraping , and I found
You are a truly life saver. great great video. thanks mate
Hi John and everyone, I'm having trouble with the html.render() method, I'd appreciate any help.
First time the method runs, it downloads chromium. After I ran it, 3 red lines were printed (Downloading Chromium & stuff I can't remember), I felt like it took too long (more than 10 minutes), so I stopped the program.
Now when I try to run a the method, the script just get stucked, I mean, it is running, but never continues to the lines after the html.render method. No errors are raising, the script simply never finishes to run.
I tried to pip uninstall requests-html and reinstall it but I'm getting the same not indicative result.
How can I troubleshoot this problem? I'm excite to work with requests-HTML and letting for of Selenium for standard rendering needs, but I can't.
Thanks a lot for anyone who cares enough to give it a try.
John: when I follow your code, @ "for item in products.absolute_links:, although I specify, e.g. 'div.product-subtext', the iteration only returns the item.text, (the link text of item) and not the sub-text of the item. This is true of price, name, and so-forth. Can you explain this behavior?
What's the difference between using requestes-html vs. scrapy or selenium?
selenium is a tool used for different purpose its by product is EXCELLENT ease in web scraping..its been 2 yrs though
Very clearly explained. May I ask if there is a GitHub repo containing the code that you used in the video?
John, I've done some code web-scraping dynamically like you in this video. But it's taking too much time because for every product it has to open its page. Is it common, is there a faster way for doing this?
Trying to recreate on a similar e-commerce website and print(products) from 4:57 gives None type. Any suggestions why?
Just print html source code, look if u looking out there
Hey john awesome video (like always). I have a question, in terms of speed would you recommend a splash or request_html?
I haven’t done any proper speed tests but they do essentially the same thing so I think it would be marginal. Requests-html has the benefit of being a python package so if that works for your needs I’d use that. Splash has the benefits of scripting though- video to come!
@@JohnWatsonRooney Thanks, this helps so much.
Nice! - is there a way of doing this for the _currently displayed page_ ? - on a YT video page I want to scrape all the recommended videos and their titles from that page . .
hi, i'm trying to retrieve the data (the list of employers /vacations initiated by jquery code) from the Canadian job bank, i made "get" request but won't be able to get inner response payload data www.jobbank.gc.ca/jobsearch/jobsearch?searchstring=&locationstring=&sort=M, i can see this payload in the firefox developer tool but failed to find the proper python library and methods to get it, is there any way other than selenium to accomplish this task? I am at the very beginning of the path of learning programming and would be grateful for any help or advice on what to read or watch to figure it out. thanks.
I think you might need to use the same approach as my sports stats video - using postman to replicate the request made by your browser, the copy that over to your python code
@@JohnWatsonRooney ok, ill try this out. thanks anyway)
Hey John, great videos. Thank you so much for it! I wanted to ask, how can I scrape multiple categories(Categories like /computers, /headphones, /monitors/, /keyboards/), do you have any video or idea for that?
Thanks for your content!
Hi bro, Did you find any technique to scrape multiple categories? Please let me know.
bravo sir, you gave me my eureka moment 👏
Thank you so much. Your video is going to help me a lot in a project that I'm going to start. One question if you don't mind, when I want to gather text but there is a part of the text is appearing and there is a[ click for more] ~>hyperlink, that prevents the text from being fully copied to the csv file. Do you have a hint or suggestions? I appreciate your help in advance
hmm the website I am trying to scrape returns status code 429... and I haven't even started scraping. Do you know what could be causing it?
Hi john , trying to run the code , i got this error with render ; AttributeError: 'Future' object has no attribute 'html' ...any help please , didn't find in google .Thanks ,
You missed an explanation: what circumstances should you use xpath v div.?
very helpful tutorial , thank you for your efforts
Can a modified version of this work on scraping links listed inside a live chat feed?
That’s not something I’ve tried but yes I think so
Nice video, but according to you which python webscraper takes less resources like memory etc?
If the website is html (no JavaScript) requests and bs4 will be the lightest in my opinion. The method in this video is slower due to the render process but still good for smaller projects - selenium is the slowest and not really designed for scraping but does work when needed
@@JohnWatsonRooney this information absolutely u must explain in single video, from fastest method and the slowest one, thx sir,
How can we use threading while scraping thousands of website links?
I've followed you code to the tee. It locks up both at Pycharm and VScode at the render statement (r.html.render(sleep=1)). I literally have to close both programs to get them to run again. Any ideas? Great video though.
If it’s the first time running the render method it should download headless chrome - I’m guessing it’s getting stuck there. Maybe try removing requests_html and reinstalling it
can't install requests-html, any ideas? I'm using windows and the error jumps with lxml , tried to install lxml and got same error
Hey John, very helpful video, but I keep having this one issue when I try to render the url, I get this error message: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
Were you able to fix it? I'm having the same problem
@@alokyathiraj Im having a similar issue as well
I can't get this to work either. I think maybe the library needs to be updated.
Hi, first tanks a lot for your tutorial. I have a question, i generate my csv file , but my separator are ',' how can i change the separator ?
Sure, after the csv file name, add in sep=“ “ and put in what separator you want to use
Fantastic demonstration.Would love to know how can we use this module to submit forms or logins
Sure that’s a good idea , I will look into it
@@JohnWatsonRooney Looking forward to it..Easing login efforts in flash enabled sites such as gmail or any.Any references now would be much helpful for me in my project!
Hi John is it possible to parse the requests-html response with bs4? I've tried passing response.text when making a bs4 Soup but it returns None.
Can somebody help me?
Hi, yes it is - I’m sure I’ve covered that before. It’s quite a useful method. Try printing the html before making the soup and check is it what you were expecting to see
I am trying using website the data shown after write in input field , otherwise the html element empty what should I use? I am using pyautogui to fill the field but I don’t know how to read the data
great as always, thanks!
Hi, thank you so much for your video. I want to ask how to scrape multiple review page in one product? I get confuse
Can You explain when should we use what??
I generally prefer sticking to selenium for all my needs.
Can you login a website using requests-html?
You can yes, you can post to the server - I have an older video on my channel where I cover the basics of this if you are interested
after use render I got this error: There is no current event loop in thread 'Thread-5 (process_request_thread)'
Hey john, After struggling with stackoverflow I am here finally..."response.html.render(sleep=3)" is giving error in django view (i .e There is no current event loop in thread 'uWSGIWorker1Core8') .....can you help me how to solve this??
While trying to get product links on the category page of the site I work for, it also takes an extra 2 links I don't want for each product. How can I remove these links that I don't want or only one word exists in the links I want, how can I get links with only that word?
hello i'm having a chromium related error when i want to render an html page can you pls tell me how can i fix it?
i tried that in jupyter and it gave me this error message: **'Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.'**
Sir how will you deal with infinite scrolling if can't find easy
that's a bit more tricky without browser automation (selenium) we can use "r.html.render(url, sleep=1, scrolldown=x)" - where x is the ammount of times to page down. Not ideal but might work
what can i do if the xpath search doesn't find anything?
You are the best, subscribed
Thank you sir. This make sense to me
Great video and easy to follow for a noob like me! Appreciate it :D
:D thank you
@@JohnWatsonRooney Do you have any videos focusing on if statements and/or keyword lists such as changing results, for example;
Junior = Entry Level
Early Professional = Entry Level
Graduate = Entry Level
etc...
Nice video - minus the try/catch with no specific exception. I know this is a tutorial, but that’s a bad habit to share. Regardless, thank you for the content.
Thanks, and yes you are absolutely right, I don’t do that anymore!
thanks for learning
Hi John, I am one of your fans. I really wonder how did you learn these techniques before? I am currently in a status that don't know how to be a self-taught web scrapper. In other words, I don't know how to learn from a myriad of knowledges on the internet. But fortunately, I found you
Thank you man really useful !!
what is better bs4 or html.xpath ???
learn to use both but generally if i can i use BS4
With requests_html, when I print the soup, I am getting the message - you are not authorized... in the page html. I tried loading the page manually , it worked , so my IP isnt blocked. Can anyone help me with this.
OMG! I would like to hit the "like" button a million times!
Thank you very much!
Hi John, I watched this video many times, you're great at explaining. However, I am getting this error "Navigation Timeout Exceeded: 8000 ms exceeded" when r.html.render(sleep=1) I even bumped up the sleep time. Please help.
Try using timeout=(number you'd like for more than 8s) instead of sleep. worked for me
@@pranitganvir449 thanks for the advice, it worked with timeout=30 and also added keep_page=True
Hi , can you help me ??
it doesnt work with aliexpress
Hope you don't mind me asking but I have been banging my head against this one for a few hours... but I am trying to pick up only a specific url from a container (the container has non product URL's).
:
"
from requests_html import HTMLSession
import pandas as pd
import time
url = 'www.fragrancenet.com/fragrances'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
products = r.html.xpath('//*[@id="resultSet"]', first=True)
print(products.absolute_links)
"
I am only looking for the p-tags under Result set called:
Any help would be super appreciated, thanks again John.
Good one, Any idea to do the same for laravel baes??
Could you do a solid for me, I’ve suffered trying to scrape this site
dude update this code I try to run the request_html but all it state is that it need chromium to work but the thing is I have chromium on my machine even the binary file too. why when I run it, it attempt to down chromium which I already have and fail to find it. I try this already a few months back now I retuen to the same I even uninstall and install everything but tge same problem.
using render for first time i haven't been able to install any thing and its giving me error
Ohhhh nice I use an API that uses this method
can you do scrapping video for tracton gyan website?
Great job keep it up keep useful
oops...You are legend...........I am blind...This is also in docs on top layer 😂(I think I need some sleep)
Amazing sir please keep posting videos like this we will help u to increase subscriber number
you. are. awesome!
Thank you♥️♥️ you are BEST💪
you shouldn't be john rooney, you should be john legend
good explanation
What does first=True do?
with requests-html "find" always returns a list, but using first=True forces it to return only a single item, the first element it finds that matches your find criteria
@@JohnWatsonRooney got it, thanks. On to pt2!
Genial el video, no conocia opcion, normalmente usaba bs4
hi sir, can you fix this problem :
AttributeError: 'NoneType' object has no attribute 'text'
Thanks, btw nice vid
Hello, can you do scraping on this page : stats.nba.com/teams/transition/
I want to compare playtype team1 percentile on offense (also the frequency) against team2 percentile on defense. can you help me, please?
Hi! Yes I can scrape that site - I have a video coming this week that scrapes a site simliar that you will be able to apply to this site too. JR
@@JohnWatsonRooney Great! Thank you for the really quick answer!
nice video 👌 and keep going
Cider | 4.0% | 44 cl
Trying this: info=r.html.find('div.Select an element with a CSS Selector:',first=True).text
The output shows:AttributeError: 'NoneType' object has no attribute 'text'
probably you choose your class incorrectly, that's why you have no elements in your output. Non type means you have no result( empty array).
used the same code and It didn't work for me, I changed the website to my desired one and I get a bunch of errors... :(
Awesome
`please make a video content for downloading and loading Chromium by requests_html
Thank Bro
Beerwulf is not a dynamic site....LOL
🖤👌🏻
The accent, where are you from?
UK near London
when I am trying to type r.html.render() I get this Unresolved attribute reference 'html' for class 'Response'
So, when copy the Xpath, I get this as a result:
/html/body/div[7]/div[4]/section/div[10]/div[3]/div[2]/div[2]/div[1]/ul[2]
Are you using chrome or Firefox? That looks like the “full xpath” option, as opposed to just the “xpath”. I am planning to do a video on xpaths to clear it up a bit more
@@JohnWatsonRooney inspector in Firefox, which leads me to think, then, that there's a difference btw Chrome and Firefox ?
There shouldn’t be but I have seen different results from both
🥰🥰🥰🥰
Great, but i took a lot of time for rendering
Didn't you get any other site instead of bear website?
Why are you promoting harmful things?
i have a problem with this code produk = r.html.xpath('/html/body/div[4]/div[2]/div[2]/div[2]/div[1]/div/div[2]',first=True)...the result is None or []..how to fix it?
I'm having the same problem, did you find a solution?
Thanks
r.html.render() not working. What can i do ?
Did you find a void?