This script I threw together saves me hours.

Поділитися
Вставка
  • Опубліковано 15 сер 2023
  • Finding out the best way to scrape data from a site is time consuming, this script uses selenium wire to view the network requests from a site and give you back a list of urls and json responses.
    Proxies: nodemaven.com/?a_aid=JohnWats...
    Patreon: / johnwatsonrooney (NEW free tier)
    Scraper API www.scrapingbee.com/?fpr=jhnwr
    Donations: www.paypal.com/donate/?hosted...
    Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
    Gear I use: www.amazon.co.uk/shop/johnwat...
  • Наука та технологія

КОМЕНТАРІ • 69

  • @liketheduck
    @liketheduck 3 місяці тому +2

    Fantastic “apprentice” content. This assumes a basic understand but also pushes the novice forward. I really appreciate it!

  • @jessejames3169
    @jessejames3169 11 місяців тому +11

    Love your thought process behind writing this! It makes it easy to follow why you do a certain step, and if it’s necessary for others! Great vids keep it up!

  • @DerekMurawsky
    @DerekMurawsky 2 місяці тому

    This is really great, and a great foundation, too. I can see this being extended to support so many things, too.

  • @Extrey
    @Extrey 11 місяців тому +7

    I didn't even know that selenium can be used like this, thank you very much, great work as always))

  • @sandunwijethunga6787
    @sandunwijethunga6787 11 місяців тому +1

    great video. thank you john❤

  • @TimoTalksTech
    @TimoTalksTech 11 місяців тому

    Amazing, just something I was looking for. Need to look into more if I could fetch all the IPs too

  • @jagdish1o1
    @jagdish1o1 10 місяців тому +5

    I used seleniumwire for create a scraping bot. It’s a very good package to grab the backend requests. What i did was using selenium i logged-in than grab the cookies and the backend api ;) than i simply closed the browser and used the python requests lib to make the request to make thing little bit faster. Eventually, i dockerized everything and than i have this container image which i than pushed on aws ecr and run parallel on aws ecs.
    Pretty amazing.

    • @datacleaningchallenge2029
      @datacleaningchallenge2029 10 місяців тому

      impressive, what's your email, need to ask you a question as relate to your code

  • @kite759
    @kite759 11 місяців тому +1

    that's very useful, thank you

  • @pldvs
    @pldvs 11 місяців тому +6

    "Because. I. Don't. Care..." 😂😂

  • @kocahmet1
    @kocahmet1 11 місяців тому +1

    golden content here

  • @tizianonakamader8177
    @tizianonakamader8177 11 місяців тому +1

    Amazing content thank you

  • @ivanowdenis
    @ivanowdenis 11 місяців тому +2

    Hello John, could you make a video how to scrape data which a server send trough a websocket connection in live mode?

  • @darylhunt9070
    @darylhunt9070 11 місяців тому +1

    good video . Do you capture keys for api in Selium wire as well. As some api use session keys

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому +2

      you can grab any headers and cookies yeah

  • @zakariaboulouarde4591
    @zakariaboulouarde4591 Місяць тому +1

    Hello thank you for the amazing video. Wanna ask please how can I bypass 403 forbidden, for cloudflare when I am requesting an Api? Thank you for all your efforts 🙏🏽

  • @mitvpankaj2454
    @mitvpankaj2454 10 місяців тому +1

    Great work bro!! And I have one question also if I want scrape Walmart everytime robot or human pop-up comes so can you please guide me how to Bypass this type of bot detection system? Thanks and love your content because of you i learned python!! 👍

    • @JohnWatsonRooney
      @JohnWatsonRooney  10 місяців тому +1

      Check out undetected chrome driver - there’s some good information for it that might help

    • @mitvpankaj2454
      @mitvpankaj2454 10 місяців тому

      I tried bro but still it's showing the same issue if you have any reference or video can you please suggest me it'll be very helpful for me and other also :)

  • @satyajeetkumar3993
    @satyajeetkumar3993 11 місяців тому +1

    Hi John!! I really appreciate this new content. I have a query to ask. I was using selenium webdriver in chrome to fetch data from a website. The script is working just fine but after certain iterations, the driver is not working properly or the way it should. I am getting a NoneType error. I tried clearing the cookie and starting a new session and then continue from where I left off but it is still not working. Any suggestions on this?? I really appreciate it!! Thanks!!

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      hard to say but when i get problems like this i always check to see what the direct output from loading the page is, you could be hitting a captcha

    • @satyajeetkumar3993
      @satyajeetkumar3993 11 місяців тому

      Actually that new page is loading properly. I didn't check for terminal output but the page is loading. After that when I am looking for an element on the same page which I know is available there, I am getting an error.

  • @StonedApe420
    @StonedApe420 11 місяців тому

    Can it make complete copy of requests with url, headers and payload?

  • @user-nj2om2vt8u
    @user-nj2om2vt8u 11 місяців тому +1

    are you using JetBrains Mono font? If yes, then how it looks so thin?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      it is yeah, I don't know I didn't do anything other than select that font sorry

  • @AleksT28
    @AleksT28 11 місяців тому +1

    i was working with selenium / selenium-wire until i could not debug the issue while selenium-wire is not listening the right port where selenium is running while dockerised.

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      that's interesting, i haven't tried dockerising it but i will keep an eye open for issues

  • @linuxkerem
    @linuxkerem 11 місяців тому +1

    Are you using arch linux sir ? And thanks for the content ! 🥰

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      thanks! its actually just ubuntu + i3

    • @linuxkerem
      @linuxkerem 11 місяців тому

      ​@@JohnWatsonRooney Wow, I guess my mind went straight to arch when I saw a hyperland style window manager 😁

  • @maloukemallouke9735
    @maloukemallouke9735 11 місяців тому

    thank you,
    i am wondering if you wine money with this tools ????

  • @throwyourmindat
    @throwyourmindat 10 місяців тому

    Hi
    Are you aware of self healing selenium scripts? Can you explain the concept of self healing and how is it even possible!? Because we find element on web page using a locator if that element isn't found we get error. How can self healing find that locator. For eg. An element found by //input[@name=email] if not found, can automatically guess the element was updated in next build as //input[@name=mailing-addrress] using self healing approach.. it would be great if you can help us understand that

  • @valoclips2896
    @valoclips2896 11 місяців тому +1

    Nice idea. But I will still prefer to log the requests via Network tab or Burp suite.
    The chromedriver detection will also kick in for some sites.

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому +1

      fair enough, it does have some uses but also limitations as you say.

  • @TheCulpritgamer
    @TheCulpritgamer 3 місяці тому

    can you please share the script that you created for my future reference ??

  • @iamshiva003
    @iamshiva003 11 місяців тому +1

    What is the vscode theme and the font used in this video?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому +1

      github dark theme and jet brains mono!

    • @iamshiva003
      @iamshiva003 11 місяців тому

      @@JohnWatsonRooney thank you

  • @AllifIzzuddin
    @AllifIzzuddin 11 місяців тому +1

    So this is kinda like playwright network events right?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому +1

      Yes same thing but I found it better to use

  • @AhmedThahir2002
    @AhmedThahir2002 11 місяців тому

    Hi John! Love your work. Could you share the codes of your videos.

    • @markbennett5626
      @markbennett5626 11 місяців тому +1

      Maybe John has the code available to Patreon members ;)

    • @AhmedThahir2002
      @AhmedThahir2002 11 місяців тому

      @@markbennett5626Ohhhhh okay no issues hehe :)

  • @satwikawasthi2002
    @satwikawasthi2002 11 місяців тому +1

    What if api only called when any user action occurs then?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      the next step to upgrade this would be to run the same but insert clicks on various page links first and check each one

    • @satwikawasthi2002
      @satwikawasthi2002 11 місяців тому

      @@JohnWatsonRooney thanks for reply🙏 also most important thing post method api which accept custom keys in its headers or payload, will not give expected response, please make video of this thing for executing it.

  • @Niuroteya
    @Niuroteya 11 місяців тому +1

    I don't really get it.. I mean you can filter Network tab by link or a word "api" too if you want to. Plus this solution will not work for everything, but Network tab will. Other than filtering only needed requests this solution doesn't seem to do anything. And yeah, you can do a bit more advanced filtering here, but.. Does this really saving a lot of time for some kind of task?
    It's just hard to see how for me. Did I miss something? I'm making AJAX scripts dealing with forms for the past year+ and for me it would be absolutely useless.

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому +4

      I use it when I am given a URL and want to do some quick checks - saving any JSON output so I can search inside all from my terminal. I chose to semi automate something I was doing regularly is all.

    • @markbennett5626
      @markbennett5626 11 місяців тому +2

      Maybe not for everyone but once scripted including user prompt for url, it'll be quicker than using network tab and much nicer response, plus can see adding the ability for the additional steps of recording session keys and further calls.. Thanks John

  • @AndyTutify
    @AndyTutify 11 місяців тому +1

    Are you no longer using neovim?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      I still use neovim, i decided to use VS Code for video demos as i thought it would include more people

  • @spab87
    @spab87 5 місяців тому

    Hi, thanks a lot, this was very helpfull to learn. I use contextlib.surpress, its actually faster than try/except and it looks better i think. Your function would look like this:
    import contextlib
    for request in driver.requests:
    with contextlib.suppress(Exception):
    data = decodesw(
    request.response.body,
    request.response.headers.get("Content-Encoding", "identity")
    )
    resp = json.loads(data.decode("UTF-16"))
    resps.append(resp)
    return resps

  • @user-qi2kt8ow5r
    @user-qi2kt8ow5r 10 місяців тому

    Can I bypass hqq.tv devtool blocking using this?

  • @user-tk5ir1hg7l
    @user-tk5ir1hg7l 11 місяців тому +1

    is this better than pupeteet network events?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 місяців тому

      I have limited experience with pupeteer, i expect it to be the same - although I prefer seelnium-wire to playwright for network events

    • @user-tk5ir1hg7l
      @user-tk5ir1hg7l 11 місяців тому

      @@JohnWatsonRooney ok, how about playwright network events, does it have similar functionality or would you still recommend going with seleniumwire

  • @bakasenpaidesu
    @bakasenpaidesu 10 місяців тому +1

    .

  • @abdelrahmankhaled8239
    @abdelrahmankhaled8239 2 місяці тому

    complete noob here just started web scraping
    for some reason the seleniumwire import is giving me this error
    import blinker._saferef
    ModuleNotFoundError: No module named 'blinker._saferef'
    I've been searching online for help for hours. changed python versions (currently using the same one you're using in the video)
    nothing seems to work.
    please help
    thank you in advance

  • @twelfth4927
    @twelfth4927 3 місяці тому

    Guys, I'm watching with passion but for what it would be helpful? What are web-scrapers actually doing?

    • @DudethatGross
      @DudethatGross 2 місяці тому

      Gathering data that would otherwise be difficult to get without a proper API

  • @Septumsempra8818
    @Septumsempra8818 10 місяців тому

    Anyone else update chrome on their pc and had all their scrapers break?😅

  • @MasoomNini
    @MasoomNini 9 місяців тому

    Hi John, big fan. Thanks for toturials ❤
    I need to contact you on any social media, i need one site scrape help kindly