I just learned this trick 2 days back. one of my friend showed me this method... and i was wondering why hasn't anyone uploaded a video on this. and here it comes.... please do keep making these videos.... they are really helpful...
I watched this video some years back and it helped me a lot, now years later I was in need of the concepts taught here. Been looking all over the internet for this video!! Was afraid it got deleted 😅
Thanks for all the tutorials John. As a newbie to Web Scraping and data science (never too old to learn at 58), I’m loving the intuitive and plain English approach you have in your demonstrations. having watched the ‘Scraping News’ video and now this one, I wonder how you could refine the script to include a search for the search bar and then suggesting a topic that is then searched for. I.e. I have a news feed favourite site that has a search bar that I can refine my chosen reading material, say ‘Ukraine’ for example, and it goes and fetches all the news from around the world on that topic. It’s then that I’d like to scrape the newsfeeds and then that your newsfeed script comes into its own. Great if you could demonstrate a video that overcomes the search aspect before the automated scraping. Thanks and keep up the videos. Easily my favourite go to learning resource.
Well if you look at the network calls when you search something you should be able to track down the endpoint they use for searches. You should then be able to call that endpoint yourself and scrape the data that way.
Fantastic, this is the video I was looking, I was wondering how I could collect the past data of matches already played by inserting the date as input information. Thx
can i access the "Statistics" too this way? like if i wanted to make a code that checks if the home team has 4 shots on target and the away team has 0 and other conditions like that
@@JohnWatsonRooney hmm thanks i will keep trying. Seems a bit difficult since some live games have livestats when you click on them but couldnt find any keys in the json file they all were false altough some shouldv been true
If i understand you correctly, yes you can - if you use playwright or selenium you can access the network events and have it return you the json data each time i loads up a page. I use this method for some sites, depending on what I am doing and how they respond
Hello, it works great. What should I do if I want the odds before the matches start? Let's say that every morning I want to copy the quotas. I notice that each match has a numerological event identifier, how do I identify this numerological event so that I can copy the odds and the next day I can enter next to each event the score that was recorded? Thank you and all the best!
Thanks so much for this tutorial. I was wondering if there is a work around when a site isn't returning any such xhr data, regardless of what links and buttons you click to try and initiate a response?
Your structuring is amazing. Since the website calls data from the API every 10 seconds or so, why did I get banned when I automated an interval to request updated data from the API? Is there a workaround not to get banned? Like, what other criteria does the website use to recognize a bot?
@Loja Outweb He mentioned the website probably works with cloudflare to avoid DDOS attacks. That's why they will block your IP if you make constant requests. Try rotating IPS like he mentioned or just lower the requests by searching every minute.
Could you do a video on something similar but where the API wants a key? I copied the request like you did into insomnia, but I cannot replicate it in there. The response says "no API key provided". I am unable to figure out how the client code in the browser embeds the api key without the request on the network tab knowing about it... The site I am trying to scrape seems to use Vue, if it makes any difference. I tried to inspect the "initiator" javascript file but obviously it is minified and unreadable.
I usually find adding the full headers works, we are then telling the backend we are a browser and we need the information - I'd have to check the site example you mentioned though to see. You can email it to me if you like, email on my YT page.
@@JohnWatsonRooney Yeh I thought I had left something out earlier when I tried it a couple of weeks ago. I then saw your video and figured I would give it another shot with copying everything "automatically" copy -> cUrl cmd, but it did not help (earlier I made the request myself "from scratch"). I will email you the site and details. Thanks!
@@matheosmattsson2811 This method will only work for public APIs - where private API keys aren't required. Usually you encrypt your key details into a hash, send it over and its decrypted by the server and your key is extracted there. This means that all an anonymous user would see in the headers from the Network tab is the encrypted hash and you can't just use an existing hash as it will also include a timestamp.
the video is really well explained, thanks for that. However I'm trying to add a condition for tennis games, how should I add the coming set "period" on this API to python
hi, cool video. how can I get a correct score market on a sport betting site . where I can print teams name with thier corresponding odds e.g Team A vs Team B , 2:1 at 9.6
For protected API, do you think it is possible to make the first call with selenium, grab a token, and from this point use it in calls toward API using requests ?
@@JohnWatsonRooney did you mean by mimicking logging oneself in there in the 1st place by using Selenium, so as to make this secret part of the Header (call it a token or cookie or whatever the site owner stated it is) accessible? I am just making a strategy as how to scrape API protected JSON stored reviews, sliced by a company name, for my master thesis. However, with no BEARER statement and code of Authorization (which you ONLY CAN SEE by Postman-analyzing a JSON GET request ONLY when logged in there) it returns only JSON 0 page (regardless of how many there might be per company) with 2 reviews only (out of 10 per JSON when logged in). So if I try to put all the code from Postman in my Web Scraping script Header, i.e. with the Bearer code, and ignore Selenium log-in, I am afraid I would miss some part of the server communication protocol and will be blocked or banned (robots.txt doesn't state anything is forbidden though). What do you advise? Btw, you make awesome tutorials, dude! I am literally living in them these days!
Yes it will - it cuts out the need to get the data from the page, I’d recommend checking this way out first and see if it can work for you. If it’s not available then rendering the page is the next option
Why have I wasted so much time manually reading out HTML results? 🤯 I guess I feared the XHR requests might be too inscrutable or there might be too many hurdles, like cookie management, request tokens/nonces etc. How often do you run into trouble with those?
It’s often down the individual site, but it’s usually just a new cookie needed. Sometimes parsing the html is the best way though! Explore the site first and then decide your approach
John! Amazing video! I am starting with coding and was nice to learn a lot with you. Question: How can I set up one filter for live games? For example, just show the live games with 0x0 on score, or with away team score once? Is it possible to filter the live games with parameters? It would be amazing to learn form you this as well. Thank you for your effort!
A few questions. If you peform this API endpoint strategy as suggested here, aren't you creating some kind of "imbalance" in the requests that the server (?) could easily detect as automated computer activity and not a real person? Something that one needs to considering avoiding being blocked when you scrape the API like suggested here (except from the obiouvs, don't do it too fast etc)? Else, also believe Captcha is not an issue here (which can be a hassle sometimes)?
Yes you can absolutely be detected and blocked still. If scraping lots of data proxy’s are a must. With most sites doing it this way you need the cookie generated from your browser - this cool data is transferred when we used insomnia and that allows us access
Hey!! Thanks for this! Its very informative! :)) I have a doubt regarding scraping, could you help me w it?? Question: I have a list of 100 (X0, X1, X2..., X99) products along w their pricing (P0, P1, P2....., P99). Is it possible to scrape the google shopping price data for all the 100 products? And if the prices of the individual products, say for instance product X0's price on google shopping is greater than the given price (P0), update that as the new price in a new column? Your input would be much appreciated! Thank you!! :)
So im trying to create a live events feed as a personal project for premier league games, so goals, cards, assists, etc etc. Would it be possible to use this method and not get banned somehow? What if i made 6 different scripts to scrape 6 different score websites? Therefore id only be sending 1 request per minute to each site Could this work?
Was trying to scrape Internet speeds from speedtest with this method, got only 2 tick boxes under "Name" section under "Fetch/XHR" tab on inspector. In "response" there is several letters only, for first tick box it's "1d" and for the other it's "1gfi". Is there anyone knowledgable enough to help me to find a way around this? Or speedtest webpage doesn't use the API and tables in the first place? (There are speedtests which I would want to scrape, and the very speeds are placed on the graph curves, so I was thinking the graphs are auto-generated based on some table).
Can you do a new youtube about Amazon for 2022? Amazon has been changed. I tried it but does not work anymore gives me 504. I tried in Java and does give me all the info.
I find your web scraping videos the most useful and user friendly in youtube. I'm just wondering if there is a way to scrape an html file from the local hard drive for practising purposes since I spend some time travelling with no internet connection, in addition, I think it would be nice to avoid overloading a server when practising.
@John Watson Rooney I got banned by SofaScore "The system identified you as a scraper and banned the IP. To use the data on the website contact the owner and request permission"
I just learned this trick 2 days back. one of my friend showed me this method... and i was wondering why hasn't anyone uploaded a video on this. and here it comes.... please do keep making these videos.... they are really helpful...
Glad you enjoyed it!
This is probably the best video I've seen on APIs ! this topic is so poorly covered on UA-cam! Amazing content thank you for this !!
This is so helpful and educational John! Keep it up mate! Love your work.
I watched this video some years back and it helped me a lot, now years later I was in need of the concepts taught here. Been looking all over the internet for this video!! Was afraid it got deleted 😅
Bro... What ! this is next level scrapin.. Beyond The Complexities of code, Yet With all the features, Thank You Very Much ! I Love You !
Thanks very kind!
Wow. This is exactly what I was looking for. Simply brilliant. Thank you!
Thanks!
Well, scrapping data from the actual API server as opposed to the webpage itself is actually a great idea. Thanks for the vid.
You are right. It would be first step to check for any API to make our life easy. Thanks John.💖
Ok I think this video solved the problem of yesterday posted in another episode about hidden api. THANKS JOHN!
😁 SUPER HELPFUL one of the best coding learning videoes, I ever watched!! you've gained a sub for life!
Thank you I’m glad you enjoyed it!
Great video, it is a lot more useful to work api then with Selenium. I improved my time to download everything from 5 to like 1 minute. Thanks
Thanks for the awesome tip, cheers from Seattle!!
Wow. That's amazing 🔥 I really like your work.
Thanks!
Saved a lot of trouble using this method, thanks!
Glad it helped!
Your work is amazing! Thanks for helping me a lot with these scraping practices!
Fantastic content as usual! 🎉
Superb, clearly presented and explained. Thank you so much.
Thank you!
Thanks for this tutorial John. Really appreciate what you are teaching here. It solved my web scraping problem. :)
That’s great I’m glad it helped
So helpful! Much easier for what I was trying than BeatifulSoup.
I hope I can buy you a beer sometime man. I appreciate this video for real. Thank you! +1 Follower
Amazing video as always
keep up the good work
Awesome video! Do you have a video on what to do with all the information that you just scraped, examples of how to use it?
Thank you very much !! I was having trouble extracting data from dynamic websites.
Excellent tutorial! Big fan of your videos
Thanks!
Thanks for all the tutorials John. As a newbie to Web Scraping and data science (never too old to learn at 58), I’m loving the intuitive and plain English approach you have in your demonstrations. having watched the ‘Scraping News’ video and now this one, I wonder how you could refine the script to include a search for the search bar and then suggesting a topic that is then searched for. I.e. I have a news feed favourite site that has a search bar that I can refine my chosen reading material, say ‘Ukraine’ for example, and it goes and fetches all the news from around the world on that topic. It’s then that I’d like to scrape the newsfeeds and then that your newsfeed script comes into its own. Great if you could demonstrate a video that overcomes the search aspect before the automated scraping. Thanks and keep up the videos. Easily my favourite go to learning resource.
Well if you look at the network calls when you search something you should be able to track down the endpoint they use for searches. You should then be able to call that endpoint yourself and scrape the data that way.
So helpful. Thanks for the concept shared freely
Thanks glad you enjoyed it
@@JohnWatsonRooney Really... You know I spend lots of time doing this via selenium python, but this just made my life much easier.
Omg its so useful!!!!! Got subbed. Thaks!
Thanks for the sub!
This is extremely useful, thanks for the tutorial!
Fantastic, this is the video I was looking, I was wondering how I could collect the past data of matches already played by inserting the date as input information. Thx
Thank you! This's really a game changer. )
That is really useful, thank you for that.
Glad it was helpful!
can i access the "Statistics" too this way? like if i wanted to make a code that checks if the home team has 4 shots on target and the away team has 0 and other conditions like that
Yes I think you could, do the same process but on the page where the stats load up and find the api
@@JohnWatsonRooney hmm thanks i will keep trying. Seems a bit difficult since some live games have livestats when you click on them but couldnt find any keys in the json file they all were false altough some shouldv been true
Thankyou so much for this video
Thank you very much Sir.. learning so much.
what about making a scrapy splash tutorial? I hope you will make it
I have one on my channel already, but will be doing more as I do more scrapy videos
@@JohnWatsonRooney it's so great to hear that. I have learned a lot with your videos
This is a perfect simple video. However, if the api called is changed how can you parse it since the old one brings old data???
Thanks in advance.
Thanks for this video!
Instead of new API calls, can I get data from the browser's network tabs when the API returns data on the client's browser?
If i understand you correctly, yes you can - if you use playwright or selenium you can access the network events and have it return you the json data each time i loads up a page. I use this method for some sites, depending on what I am doing and how they respond
Super helpful. Thanks
Hello, it works great. What should I do if I want the odds before the matches start? Let's say that every morning I want to copy the quotas. I notice that each match has a numerological event identifier, how do I identify this numerological event so that I can copy the odds and the next day I can enter next to each event the score that was recorded? Thank you and all the best!
Thank you so much!
Great Video
Thank you!
Thanks so much for this tutorial. I was wondering if there is a work around when a site isn't returning any such xhr data, regardless of what links and buttons you click to try and initiate a response?
Your structuring is amazing.
Since the website calls data from the API every 10 seconds or so, why did I get banned when I automated an interval to request updated data from the API?
Is there a workaround not to get banned?
Like, what other criteria does the website use to recognize a bot?
@Loja Outweb how did you fix yours?
@Loja Outweb He mentioned the website probably works with cloudflare to avoid DDOS attacks. That's why they will block your IP if you make constant requests. Try rotating IPS like he mentioned or just lower the requests by searching every minute.
@Parth Kulkarni he has another video on that ua-cam.com/video/vJwcW2gCCE4/v-deo.html
Thanks man
Can I apply this method on flashscore websites? I guess that site doesn't have api url
Could you do a video on something similar but where the API wants a key? I copied the request like you did into insomnia, but I cannot replicate it in there. The response says "no API key provided". I am unable to figure out how the client code in the browser embeds the api key without the request on the network tab knowing about it... The site I am trying to scrape seems to use Vue, if it makes any difference. I tried to inspect the "initiator" javascript file but obviously it is minified and unreadable.
I usually find adding the full headers works, we are then telling the backend we are a browser and we need the information - I'd have to check the site example you mentioned though to see. You can email it to me if you like, email on my YT page.
@@JohnWatsonRooney Yeh I thought I had left something out earlier when I tried it a couple of weeks ago. I then saw your video and figured I would give it another shot with copying everything "automatically" copy -> cUrl cmd, but it did not help (earlier I made the request myself "from scratch"). I will email you the site and details. Thanks!
@@matheosmattsson2811 This method will only work for public APIs - where private API keys aren't required. Usually you encrypt your key details into a hash, send it over and its decrypted by the server and your key is extracted there. This means that all an anonymous user would see in the headers from the Network tab is the encrypted hash and you can't just use an existing hash as it will also include a timestamp.
can you please try to make a video on how to scrape websites that are using cloudflare protection?
the video is really well explained, thanks for that. However I'm trying to add a condition for tennis games, how should I add the coming set "period" on this API to python
hi, cool video. how can I get a correct score market on a sport betting site . where I can print teams name with thier corresponding odds e.g Team A vs Team B , 2:1 at 9.6
For protected API, do you think it is possible to make the first call with selenium, grab a token, and from this point use it in calls toward API using requests ?
Yes I think so, you can take the cookie from your selenium request and reuse it in other parts of the code
@@JohnWatsonRooney it seems that it is easier with selenium-wire, since you actually get access to all requests/responses including the headers
@@JohnWatsonRooney did you mean by mimicking logging oneself in there in the 1st place by using Selenium, so as to make this secret part of the Header (call it a token or cookie or whatever the site owner stated it is) accessible? I am just making a strategy as how to scrape API protected JSON stored reviews, sliced by a company name, for my master thesis. However, with no BEARER statement and code of Authorization (which you ONLY CAN SEE by Postman-analyzing a JSON GET request ONLY when logged in there) it returns only JSON 0 page (regardless of how many there might be per company) with 2 reviews only (out of 10 per JSON when logged in). So if I try to put all the code from Postman in my Web Scraping script Header, i.e. with the Bearer code, and ignore Selenium log-in, I am afraid I would miss some part of the server communication protocol and will be blocked or banned (robots.txt doesn't state anything is forbidden though). What do you advise?
Btw, you make awesome tutorials, dude! I am literally living in them these days!
How do I prefix team names with their log position on soccer upcoming fixture? How do I add Points per game PPG column? Please assist
Will this approach work with dynamic web pages? Or is requests-html still the best approach for dynamic pages?
Yes it will - it cuts out the need to get the data from the page, I’d recommend checking this way out first and see if it can work for you. If it’s not available then rendering the page is the next option
Sir which theme you use in vs code???
Gruvbox material - it’s in the extensions
@@JohnWatsonRooney thank you sir
Why have I wasted so much time manually reading out HTML results? 🤯
I guess I feared the XHR requests might be too inscrutable or there might be too many hurdles, like cookie management, request tokens/nonces etc.
How often do you run into trouble with those?
It’s often down the individual site, but it’s usually just a new cookie needed. Sometimes parsing the html is the best way though! Explore the site first and then decide your approach
@@JohnWatsonRooney I will!
Thank you for being so helpful in the name of empowering the users again! ❤
John! Amazing video! I am starting with coding and was nice to learn a lot with you. Question: How can I set up one filter for live games? For example, just show the live games with 0x0 on score, or with away team score once? Is it possible to filter the live games with parameters? It would be amazing to learn form you this as well. Thank you for your effort!
I think the API would simply return no-score draws as just that - 0 : 0
A few questions. If you peform this API endpoint strategy as suggested here, aren't you creating some kind of "imbalance" in the requests that the server (?) could easily detect as automated computer activity and not a real person? Something that one needs to considering avoiding being blocked when you scrape the API like suggested here (except from the obiouvs, don't do it too fast etc)? Else, also believe Captcha is not an issue here (which can be a hassle sometimes)?
Yes you can absolutely be detected and blocked still. If scraping lots of data proxy’s are a must. With most sites doing it this way you need the cookie generated from your browser - this cool data is transferred when we used insomnia and that allows us access
How could I get the current minutes?
Is that possible with node is sir?
Yes of course, I don’t know Node or JavaScript that well though I’m afraid!
I can't determine the game minute - is there a solution?
Hi sir, any ways for scrape video stream (live video) football?
Hey!! Thanks for this! Its very informative! :))
I have a doubt regarding scraping, could you help me w it??
Question:
I have a list of 100 (X0, X1, X2..., X99) products along w their pricing (P0, P1, P2....., P99).
Is it possible to scrape the google shopping price data for all the 100 products? And if the prices of the individual products, say for instance product X0's price on google shopping is greater than the given price (P0), update that as the new price in a new column?
Your input would be much appreciated!
Thank you!! :)
So im trying to create a live events feed as a personal project for premier league games, so goals, cards, assists, etc etc. Would it be possible to use this method and not get banned somehow?
What if i made 6 different scripts to scrape 6 different score websites? Therefore id only be sending 1 request per minute to each site
Could this work?
Is possible get statics in real tiime? Bad english (brazilian boy) :)
Was trying to scrape Internet speeds from speedtest with this method, got only 2 tick boxes under "Name" section under "Fetch/XHR" tab on inspector. In "response" there is several letters only, for first tick box it's "1d" and for the other it's "1gfi". Is there anyone knowledgable enough to help me to find a way around this? Or speedtest webpage doesn't use the API and tables in the first place? (There are speedtests which I would want to scrape, and the very speeds are placed on the graph curves, so I was thinking the graphs are auto-generated based on some table).
Can you do a new youtube about Amazon for 2022? Amazon has been changed. I tried it but does not work anymore gives me 504. I tried in Java and does give me all the info.
I learned python too!
May you instruct everyone a step by step data analysis project from scratch? Thank you in advance!
This is exactly what I was looking for to scrape off live data on bitcoin etc. But sir, is this illegal?
I find your web scraping videos the most useful and user friendly in youtube. I'm just wondering if there is a way to scrape an html file from the local hard drive for practising purposes since I spend some time travelling with no internet connection, in addition, I think it would be nice to avoid overloading a server when practising.
Sure, save the html to file and open it in Python - it will load into bs4 for scraping practise on the go
@@JohnWatsonRooney wonderful, thank you so much.
Will you start a discord server?
I’ve thought about it, I will at some point and I’ll post it up so you guys know. Just not sure when yet!
Man, I have been scraping wrong for so long.
for bet365 any ideas? :(
@John Watson Rooney I got banned by SofaScore "The system identified you as a scraper and banned the IP. To use the data on the website contact the owner and request permission"
unfortunately that's a part of it, you'll need to use proxies ideally to continue - it kinda turns into an arms race
Cloudflare didn't even give me a chance, blocked my IP instantly 😂😂
Ah yeah that’s a real possibility, I use my vpn for testing usually but even then a lot of those IPs are blocked already so it’s much harder.
I find another problem. When i run scoreslive.py, it raise the exception JSONDecodeError, would you pls help me with that? thanks ahead