Render Dynamic Pages - Web Scraping Product Links with Python

John Watson Rooney

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 25 жов 2024

КОМЕНТАРІ • 166

@JohnWatsonRooney 4 роки тому ⁺¹⁵
Keyboard too loud? I've been using my mech kb again.. Is it too distracting?
@11hamma 4 роки тому ⁺⁵
i think its fine, at least i didnt get distracted
@11hamma 4 роки тому ⁺³
@Vishal Gupta that website is using a Javascript to load the content.
But first try using the library explained in this video by John. It looks like you can get the work done through it.
(i haven't used it myself so cant vouch for it) Anyhow is this library fails, you can definitely use selenium and get your work done. Selenium opens up the page in some of its browser and then load the page there which loads all of the page contents and in fact gives you the option of clicking at a particular web element.
A tip: just load the page by selenium library. Then pass source code of that page into the bs4 also know as BeautifulSoup library and scrap the site in normal way from there on. It's essential because selenium's methods for extracting information out of website takes a lot amout of time and bs4 is much faster instead and has better error handling.
@Neil4Speed 4 роки тому
Not at all, makes it feel like you are working away!
@dimaua1830 3 роки тому
I enjoy the sound. It's like in hackers in the movies :)
@kavehyarohi2886 2 роки тому
kind of enjoyed it !
@xilllllix 5 місяців тому ⁺¹
i'm going through ALL of your videos and just finished this one! learning so much it's incredible!
@schlotto Рік тому ⁺⁴
THANK YOU for this video and all the others. I am learning web scraping to gather data for my PhD thesis and you have helped me make such great progress in just a few days. :)
@ottomanasina1254 4 роки тому ⁺⁵
Amazing explanation skills! Everything was clear. One of the greatest video for web scraping so far! Good job, Good luck!!
@kewl201 3 роки тому ⁺⁴
Man this is some amazing content. So glad i found your channel! Definitely earned a subscribe.
@JohnWatsonRooney 3 роки тому
Thanks!
@agsantiago22 3 роки тому ⁺¹
Lifesaver! Thank you so much! Wish you the best of luck with your channel!
@edcoughlan5742 4 роки тому ⁺⁵
I can get data from static websites using scrapy with relative ease, but I always come unstuck when I try the same with dynamic websites; I might give "html_requests' a go instead of my usual scrapy-selenium combo...Thanks for the video! 👊👊👊
@JohnWatsonRooney 4 роки тому ⁺¹
Glad you liked it - give it a go. I believe scrapy-splash is an add on for scrapy that can reload dynamic pages but I’m yet to try it
@mia_bobia_ 4 місяці тому
this was super useful! I have a project rn that needs to scrape on many pages that need renderer. This looks much more lightweight than what I'm using rn (selenium)
@Neil4Speed 4 роки тому ⁺²
Great video John as always - Thanks!
@JohnWatsonRooney 4 роки тому
Thank you!
@dobcs3236 Рік тому ⁺¹
You are a great and creative person...keep going champ.
@farhadkhan3893 2 роки тому ⁺¹
Awesome!, I was searching for such type of scraping , and I found
@kavehyarohi2886 2 роки тому ⁺¹
You are a truly life saver. great great video. thanks mate
@Aaron-qn1gu 3 роки тому ⁺¹
When I use Xpath, in products (on a different site, but same principles) terminal keeps returning 'None', the site is gwt based, would that affect xpath from working?
@bagia1000 3 роки тому ⁺¹
Hi, I tried your code on other website, but when I arrived at print(products) part, it returns 'NoneType' object. The code get no url. What should I do?. I tried to use the user-agent, but also return nothing
@charisthawhite2793 3 роки тому ⁺²
Hello John,
if i add command r.html.render(sleep=1) the output be "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.", i am anything on google, no clue, any idea?
@JohnWatsonRooney 3 роки тому
Hiya! Are you running it in a jupyter notebook or similar? The way they work conflicts with the render function - try running it in vs code or similar and that should work
@charisthawhite2793 3 роки тому
@@JohnWatsonRooney its running on vs code, but i got new error
python .\coba.py
Traceback (most recent call last):
File ".\coba.py", line 19, in
print(r.html.xpath("//div[@class='span6']/h1", first=True).text)
AttributeError: 'NoneType' object has no attribute 'text', can you tell me where do i go wrong?
@neginbabaiha9287 10 місяців тому
Very clearly explained. May I ask if there is a GitHub repo containing the code that you used in the video?
@Dome8 10 місяців тому
You missed an explanation: what circumstances should you use xpath v div.?
@Nope-12485 2 роки тому ⁺¹
Nice video - minus the try/catch with no specific exception. I know this is a tutorial, but that’s a bad habit to share. Regardless, thank you for the content.
@JohnWatsonRooney 2 роки тому ⁺¹
Thanks, and yes you are absolutely right, I don’t do that anymore!
@justinames5439 2 роки тому ⁺¹
John: when I follow your code, @ "for item in products.absolute_links:, although I specify, e.g. 'div.product-subtext', the iteration only returns the item.text, (the link text of item) and not the sub-text of the item. This is true of price, name, and so-forth. Can you explain this behavior?
@royteicher 2 роки тому ⁺¹
Hi John and everyone, I'm having trouble with the html.render() method, I'd appreciate any help.
First time the method runs, it downloads chromium. After I ran it, 3 red lines were printed (Downloading Chromium & stuff I can't remember), I felt like it took too long (more than 10 minutes), so I stopped the program.
Now when I try to run a the method, the script just get stucked, I mean, it is running, but never continues to the lines after the html.render method. No errors are raising, the script simply never finishes to run.
I tried to pip uninstall requests-html and reinstall it but I'm getting the same not indicative result.
How can I troubleshoot this problem? I'm excite to work with requests-HTML and letting for of Selenium for standard rendering needs, but I can't.
Thanks a lot for anyone who cares enough to give it a try.
@kooy2254 3 роки тому
Hi John, I am one of your fans. I really wonder how did you learn these techniques before? I am currently in a status that don't know how to be a self-taught web scrapper. In other words, I don't know how to learn from a myriad of knowledges on the internet. But fortunately, I found you
@youcannotsaypopandforgetth7609 4 роки тому ⁺¹
Hey john awesome video (like always). I have a question, in terms of speed would you recommend a splash or request_html?
@JohnWatsonRooney 4 роки тому ⁺²
I haven’t done any proper speed tests but they do essentially the same thing so I think it would be marginal. Requests-html has the benefit of being a python package so if that works for your needs I’d use that. Splash has the benefits of scripting though- video to come!
@youcannotsaypopandforgetth7609 4 роки тому
@@JohnWatsonRooney Thanks, this helps so much.
@agsantiago22 3 роки тому ⁺¹
OMG! I would like to hit the "like" button a million times!
@JohnWatsonRooney 3 роки тому
Thank you very much!
@paulblart8262 2 роки тому
bravo sir, you gave me my eureka moment 👏
@NomanAkhtar-i9j Рік тому
Amazing video. I'm wondering how can we scrape all the pictures for the product if they are rendered dynamically (like in a slideshow)
@Mr.AIFella 8 місяців тому
Thank you so much. Your video is going to help me a lot in a project that I'm going to start. One question if you don't mind, when I want to gather text but there is a part of the text is appearing and there is a[ click for more] ~>hyperlink, that prevents the text from being fully copied to the csv file. Do you have a hint or suggestions? I appreciate your help in advance
@mohamadalhamawi6437 2 роки тому ⁺¹
very helpful tutorial , thank you for your efforts
@GabrielMendes-jy4mp 3 роки тому
John, I've done some code web-scraping dynamically like you in this video. But it's taking too much time because for every product it has to open its page. Is it common, is there a faster way for doing this?
@gitgosc7075 2 роки тому ⁺¹
great as always, thanks!
@Gaz86JPN Рік тому ⁺¹
Thanks, again super easy to follow!
@JohnWatsonRooney Рік тому
Thank you very much! Appreciate it.
@anirbanpatra3017 Рік тому
Can You explain when should we use what??
I generally prefer sticking to selenium for all my needs.
@by_westy 2 роки тому
i tried that in jupyter and it gave me this error message: **'Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.'**
@imfinitiamusic.4632 2 роки тому ⁺¹
You are the best, subscribed
@abhilash93v 4 роки тому ⁺¹
Fantastic demonstration.Would love to know how can we use this module to submit forms or logins
@JohnWatsonRooney 4 роки тому
Sure that’s a good idea , I will look into it
@abhilash93v 4 роки тому
@@JohnWatsonRooney Looking forward to it..Easing login efforts in flash enabled sites such as gmail or any.Any references now would be much helpful for me in my project!
@PhilipRhoadesP Рік тому
Nice! - is there a way of doing this for the _currently displayed page_ ? - on a YT video page I want to scrape all the recommended videos and their titles from that page . .
@itsmehemant7 Рік тому
Hey john, After struggling with stackoverflow I am here finally..."response.html.render(sleep=3)" is giving error in django view (i .e There is no current event loop in thread 'uWSGIWorker1Core8') .....can you help me how to solve this??
@richu-21 3 роки тому ⁺¹
How can we use threading while scraping thousands of website links?
@pinkypromisesx3 3 роки тому ⁺²
What's the difference between using requestes-html vs. scrapy or selenium?
@z.heisenberg Рік тому
selenium is a tool used for different purpose its by product is EXCELLENT ease in web scraping..its been 2 yrs though
@nostalgeomusic 3 роки тому ⁺¹
Great video and easy to follow for a noob like me! Appreciate it :D
@JohnWatsonRooney 3 роки тому
:D thank you
@nostalgeomusic 3 роки тому ⁺¹
@@JohnWatsonRooney Do you have any videos focusing on if statements and/or keyword lists such as changing results, for example;
Junior = Entry Level
Early Professional = Entry Level
Graduate = Entry Level
etc...
@christenw.1726 Рік тому ⁺¹
Can a modified version of this work on scraping links listed inside a live chat feed?
@JohnWatsonRooney Рік тому
That’s not something I’ve tried but yes I think so
@itstisn 2 роки тому
Hi, thank you so much for your video. I want to ask how to scrape multiple review page in one product? I get confuse
@alaaabdullah2648 Рік тому
I am trying using website the data shown after write in input field , otherwise the html element empty what should I use? I am using pyautogui to fill the field but I don’t know how to read the data
@janlisowski5396 3 роки тому
hmm the website I am trying to scrape returns status code 429... and I haven't even started scraping. Do you know what could be causing it?
@narjesatia 3 роки тому
Hi john , trying to run the code , i got this error with render ; AttributeError: 'Future' object has no attribute 'html' ...any help please , didn't find in google .Thanks ,
@alexdiaz4371 3 роки тому
can't install requests-html, any ideas? I'm using windows and the error jumps with lxml , tried to install lxml and got same error
@samibdh 3 роки тому ⁺¹
Thank you man really useful !!
@johnmurray6405 2 роки тому ⁺¹
I've followed you code to the tee. It locks up both at Pycharm and VScode at the render statement (r.html.render(sleep=1)). I literally have to close both programs to get them to run again. Any ideas? Great video though.
@JohnWatsonRooney 2 роки тому
If it’s the first time running the render method it should download headless chrome - I’m guessing it’s getting stuck there. Maybe try removing requests_html and reinstalling it
@mohammadkhosrotabar5658 2 роки тому
after use render I got this error: There is no current event loop in thread 'Thread-5 (process_request_thread)'
@itsmehemant7 Рік тому ⁺¹
oops...You are legend...........I am blind...This is also in docs on top layer 😂(I think I need some sleep)
@alessioturcoliveri9840 2 роки тому ⁺¹
Hi John is it possible to parse the requests-html response with bs4? I've tried passing response.text when making a bs4 Soup but it returns None.
Can somebody help me?
@JohnWatsonRooney 2 роки тому
Hi, yes it is - I’m sure I’ve covered that before. It’s quite a useful method. Try printing the html before making the soup and check is it what you were expecting to see
@vincentamus 3 роки тому ⁺¹
Hey John, great videos. Thank you so much for it! I wanted to ask, how can I scrape multiple categories(Categories like /computers, /headphones, /monitors/, /keyboards/), do you have any video or idea for that?
Thanks for your content!
@navindubimsara9157 9 місяців тому
Hi bro, Did you find any technique to scrape multiple categories? Please let me know.
@rohangadgil4527 Рік тому
With requests_html, when I print the soup, I am getting the message - you are not authorized... in the page html. I tried loading the page manually , it worked , so my IP isnt blocked. Can anyone help me with this.
@fsamobby 4 роки тому ⁺²
hi, i'm trying to retrieve the data (the list of employers /vacations initiated by jquery code) from the Canadian job bank, i made "get" request but won't be able to get inner response payload data www.jobbank.gc.ca/jobsearch/jobsearch?searchstring=&locationstring=&sort=M, i can see this payload in the firefox developer tool but failed to find the proper python library and methods to get it, is there any way other than selenium to accomplish this task? I am at the very beginning of the path of learning programming and would be grateful for any help or advice on what to read or watch to figure it out. thanks.
@JohnWatsonRooney 4 роки тому ⁺¹
I think you might need to use the same approach as my sports stats video - using postman to replicate the request made by your browser, the copy that over to your python code
@fsamobby 4 роки тому
@@JohnWatsonRooney ok, ill try this out. thanks anyway)
@ubaidkhan-rr3ow 3 роки тому ⁺¹
Thank you sir. This make sense to me
@simpleffective186 Рік тому
what can i do if the xpath search doesn't find anything?
@benoitdefays578 4 роки тому ⁺¹
Hi, first tanks a lot for your tutorial. I have a question, i generate my csv file , but my separator are ',' how can i change the separator ?
@JohnWatsonRooney 4 роки тому
Sure, after the csv file name, add in sep=“ “ and put in what separator you want to use
@sandilemfazi8624 2 роки тому ⁺¹
Hey John, very helpful video, but I keep having this one issue when I try to render the url, I get this error message: RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
@alokyathiraj 2 роки тому
Were you able to fix it? I'm having the same problem
@jaredspilky9699 2 роки тому
@@alokyathiraj Im having a similar issue as well
@donnaperyginathome Рік тому
I can't get this to work either. I think maybe the library needs to be updated.
@abhishekkamuni9971 4 роки тому ⁺¹
Nice video, but according to you which python webscraper takes less resources like memory etc?
@JohnWatsonRooney 4 роки тому ⁺¹
If the website is html (no JavaScript) requests and bs4 will be the lightest in my opinion. The method in this video is slower due to the render process but still good for smaller projects - selenium is the slowest and not really designed for scraping but does work when needed
@tokoindependen7458 3 роки тому
@@JohnWatsonRooney this information absolutely u must explain in single video, from fastest method and the slowest one, thx sir,
@سیدمهدیهاشمی-ل1ط 4 роки тому ⁺¹
Great job keep it up keep useful
@ssh6467 4 роки тому ⁺¹
Thank you♥️♥️ you are BEST💪
@bunyaminsahiner9060 3 роки тому
While trying to get product links on the category page of the site I work for, it also takes an extra 2 links I don't want for each product. How can I remove these links that I don't want or only one word exists in the links I want, how can I get links with only that word?
@christinahachem6649 3 роки тому
hello i'm having a chromium related error when i want to render an html page can you pls tell me how can i fix it?
@mylordlucifer 3 роки тому ⁺¹
thanks for learning
@surendratamang8848 4 роки тому ⁺¹
Sir how will you deal with infinite scrolling if can't find easy
@JohnWatsonRooney 4 роки тому
that's a bit more tricky without browser automation (selenium) we can use "r.html.render(url, sleep=1, scrolldown=x)" - where x is the ammount of times to page down. Not ideal but might work
@LogansRunnersVideo 3 роки тому
Trying to recreate on a similar e-commerce website and print(products) from 4:57 gives None type. Any suggestions why?
@tokoindependen7458 3 роки тому
Just print html source code, look if u looking out there
@nikhilsaikondapaneni6657 7 місяців тому
using render for first time i haven't been able to install any thing and its giving me error
@engineerbaaniya4846 4 роки тому ⁺⁴
Amazing sir please keep posting videos like this we will help u to increase subscriber number
@patweru7471 Рік тому
Good one, Any idea to do the same for laravel baes??
@abhishekkamuni9971 4 роки тому ⁺²
Can you login a website using requests-html?
@JohnWatsonRooney 4 роки тому
You can yes, you can post to the server - I have an older video on my channel where I cover the basics of this if you are interested
@leoyuanluo 3 роки тому ⁺¹
you. are. awesome!
@kerteradih3721 Рік тому
Could you do a solid for me, I’ve suffered trying to scrape this site
@stern7658 5 місяців тому
dude update this code I try to run the request_html but all it state is that it need chromium to work but the thing is I have chromium on my machine even the binary file too. why when I run it, it attempt to down chromium which I already have and fail to find it. I try this already a few months back now I retuen to the same I even uninstall and install everything but tge same problem.
@momq112233 4 роки тому ⁺¹
nice video 👌 and keep going
@thetransferaccount4586 Рік тому
good explanation
@Neil4Speed 4 роки тому
Hope you don't mind me asking but I have been banging my head against this one for a few hours... but I am trying to pick up only a specific url from a container (the container has non product URL's).
:
"
from requests_html import HTMLSession
import pandas as pd
import time
url = 'www.fragrancenet.com/fragrances'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
products = r.html.xpath('//*[@id="resultSet"]', first=True)
print(products.absolute_links)
"
I am only looking for the p-tags under Result set called:
Any help would be super appreciated, thanks again John.
@linxx1184 3 роки тому
Hi John, I watched this video many times, you're great at explaining. However, I am getting this error "Navigation Timeout Exceeded: 8000 ms exceeded" when r.html.render(sleep=1) I even bumped up the sleep time. Please help.
@pranitganvir449 3 роки тому ⁺¹
Try using timeout=(number you'd like for more than 8s) instead of sleep. worked for me
@linxx1184 3 роки тому
@@pranitganvir449 thanks for the advice, it worked with timeout=30 and also added keep_page=True
@marco-3942 Рік тому
Hi , can you help me ??
@papusa9878 3 роки тому ⁺¹
Ohhhh nice I use an API that uses this method
@dyegoborges9985 4 роки тому ⁺¹
it doesnt work with aliexpress
@sheldon-j9h Рік тому
can you do scrapping video for tracton gyan website?
@olafecub 3 роки тому
Genial el video, no conocia opcion, normalmente usaba bs4
@jaydecanon1314 Рік тому
you shouldn't be john rooney, you should be john legend
@artabra1019 4 роки тому ⁺¹
what is better bs4 or html.xpath ???
@JohnWatsonRooney 4 роки тому
learn to use both but generally if i can i use BS4
@dickyindra4923 2 роки тому
hi sir, can you fix this problem :
AttributeError: 'NoneType' object has no attribute 'text'
Thanks, btw nice vid
@engineerbaaniya4846 4 роки тому ⁺¹
Awesome
@meme_me 2 роки тому ⁺¹
used the same code and It didn't work for me, I changed the website to my desired one and I get a bunch of errors... :(
@AlejandroKarlitos 4 роки тому ⁺¹
Thank Bro
@ashishtiwari1912 4 роки тому
Cider | 4.0% | 44 cl
Trying this: info=r.html.find('div.Select an element with a CSS Selector:',first=True).text
The output shows:AttributeError: 'NoneType' object has no attribute 'text'
@splashoui3760 3 роки тому
probably you choose your class incorrectly, that's why you have no elements in your output. Non type means you have no result( empty array).
@tsay214 3 роки тому
What does first=True do?
@JohnWatsonRooney 3 роки тому ⁺¹
with requests-html "find" always returns a list, but using first=True forces it to return only a single item, the first element it finds that matches your find criteria
@tsay214 3 роки тому
@@JohnWatsonRooney got it, thanks. On to pt2!
@sinamobasheri3632 4 роки тому ⁺¹
🖤👌🏻
@MrGarrincha11 4 роки тому ⁺¹
Hello, can you do scraping on this page : stats.nba.com/teams/transition/
I want to compare playtype team1 percentile on offense (also the frequency) against team2 percentile on defense. can you help me, please?
@JohnWatsonRooney 4 роки тому
Hi! Yes I can scrape that site - I have a video coming this week that scrapes a site simliar that you will be able to apply to this site too. JR
@MrGarrincha11 4 роки тому ⁺¹
@@JohnWatsonRooney Great! Thank you for the really quick answer!
@barguybrady 4 роки тому
So, when copy the Xpath, I get this as a result:
@barguybrady 4 роки тому
/html/body/div[7]/div[4]/section/div[10]/div[3]/div[2]/div[2]/div[1]/ul[2]
@JohnWatsonRooney 4 роки тому
Are you using chrome or Firefox? That looks like the “full xpath” option, as opposed to just the “xpath”. I am planning to do a video on xpaths to clear it up a bit more
@barguybrady 4 роки тому
@@JohnWatsonRooney inspector in Firefox, which leads me to think, then, that there's a difference btw Chrome and Firefox ?
@JohnWatsonRooney 4 роки тому
There shouldn’t be but I have seen different results from both
@samuelricard3895 3 роки тому
when I am trying to type r.html.render() I get this Unresolved attribute reference 'html' for class 'Response'
@signin7740 3 роки тому ⁺¹
Beerwulf is not a dynamic site....LOL
@sydpao2224 2 роки тому ⁺¹
The accent, where are you from?
@JohnWatsonRooney 2 роки тому
UK near London
@hammadrafique7313 Рік тому
Great, but i took a lot of time for rendering
@saeeahmed5213 2 роки тому
🥰🥰🥰🥰
@CodeGlintHub-kn9fx 4 місяці тому
Didn't you get any other site instead of bear website?
Why are you promoting harmful things?
@msyahdan183 Рік тому
i have a problem with this code produk = r.html.xpath('/html/body/div[4]/div[2]/div[2]/div[2]/div[1]/div/div[2]',first=True)...the result is None or []..how to fix it?
@RedSpark_ Рік тому
I'm having the same problem, did you find a solution?
Thanks
@sahirjaman10 2 роки тому
r.html.render() not working. What can i do ?
@marco-3942 Рік тому
Did you find a void?

Наступне

Автоматичне відтворення

Rendering Dynamic Pages 2! - Web Scraping ALL products with Python