This AI Scraper Update Changes EVERYTHING!!

Поділитися
Вставка
  • Опубліковано 7 січ 2025

КОМЕНТАРІ • 95

  • @hannespi2886
    @hannespi2886 3 місяці тому +11

    Can't believe this, you did it. I've been coding non stop the last 5 days because of your last video on this, thaaank youu!!

    • @redamarzouk
      @redamarzouk  3 місяці тому +2

      My pleasure 🙏

    • @JayS.-mm3qr
      @JayS.-mm3qr 2 місяці тому +1

      How did a scraper help you code?

  • @randomchannelname9061
    @randomchannelname9061 3 місяці тому +25

    Nice job 👍🏻
    Perhaps llama locally and / or from groq would be a nice improvement

  • @tanvirahmed1959
    @tanvirahmed1959 3 місяці тому +23

    Please integrate llama3 locally(without any api) as many of us run llama3 locally.

    • @reezlaw
      @reezlaw 2 місяці тому

      I don't know anything about this project but since it's open source I suppose you can just run Ollama with the OpenAI compatible API and simply replace the URL in the code, use whatever model you want

  • @muhammadadil-v9i
    @muhammadadil-v9i Місяць тому +1

    i cant explain in words, what you do thanks for your kind of efforts!!!

  • @charliecheesman
    @charliecheesman Місяць тому +2

    Perhaps integrate LiteLLM and then you can have full choice over models with the scraper?

  • @DevJonny
    @DevJonny 3 місяці тому +2

    nice to see you getting traction, I would love to see some content on how to mitigate and avoid being blocked, especially by cloudflare

  • @michaelpongrac2364
    @michaelpongrac2364 3 місяці тому +2

    Great Work!!!
    I appreciate that you have already made it run locally and created a resume scraper.
    Would you possibly combine the two by using the resume scraper with additional inputs as part of creating a json profile which could be used as search criteria input for scraping job searches such as indeed, stepstone, or other similar sites?
    It would be great to have the match percentage from the scraping to be used as a filter and/or sorting.
    The reason that I ask this is because it has multiple uses. If the json search criteria profile had some other definition, it could still be used as generic input values for the search process, thus allowing the match percentage functionality to have a universal application. The second use is to have a single profile that would deliver better search results than the original profiles such as Indeed and Stepstone.
    An additional option could be to use a starting location and radius to help limit the data to be processed. There are map apis that compute the travel distance between two points as well as the travel time based upon the travel mode (car, bus/train, bike, walk). This would add a lot of value to searches. It could also be added to the match percentage when used.
    I have one additional request. Could you set an option to change the language to German? If you need, I can help with the translation since I'm an American working in Germany. It would make things a lot easier for people in Germany. I already have a json structure. If you would like my help, let me know.

  • @GundamExia88
    @GundamExia88 3 місяці тому +3

    Great video! Having the ability to use locally hosted ollama on the network would be great. I have ollama running llama3 on another machine on the same network.

  • @SCHaworth
    @SCHaworth 3 місяці тому +7

    hmm. I already made a universal headless chrome scraper. Mine can even interact with with the page.
    But youre a better man than me for sharing.

    • @TLCMEDIA1
      @TLCMEDIA1 3 місяці тому

      Mind sharing your code mate

    • @aleksandreliott5440
      @aleksandreliott5440 Місяць тому +1

      You mind sharing your code with us?

    • @SCHaworth
      @SCHaworth 25 днів тому

      @@aleksandreliott5440 bruh. turn on -remote-debugging-port=9222
      call it to navigate to a page, save post js rendered html to teml, then use w3m -dump to decode it to a pretty text file.
      Its just combining two already existing tools, and it will scrape anything you can visit yourself, and quite well.
      And yes, you can also just launch in headless for more speed. (depends on your use)

  • @ketchup1993
    @ketchup1993 3 місяці тому +1

    Maybe a way to circumvent the token issue is to calculate tokens, then cut before token limit of the model, then continue after the cutoff and iterate until you got the full page

  • @cadiszu9855
    @cadiszu9855 3 місяці тому +6

    Auto subscribe to people that share useful free stuffs. Thanks for this!

  • @iltodes7319
    @iltodes7319 3 місяці тому +1

    Good job bro. Please continue

  • @RJ.M.
    @RJ.M. 3 місяці тому +1

    You are a wonderful person, thank you for sharing 💪

    • @redamarzouk
      @redamarzouk  3 місяці тому

      Thank you, means a lot!🙏

  • @Opeyemi.sanusi
    @Opeyemi.sanusi 3 місяці тому +7

    Love this is open source. Thank you!🙏🏾 I already knew how you were going to handle the pagination before you started talking 😂 a fox might be to add a starting url and a field for the second page.
    Another suggestion is proxy 😢
    I have more interesting adds to this

  • @mawkuri5496
    @mawkuri5496 3 місяці тому +6

    can i use llama that is running locally on my pc?

  • @mikew2883
    @mikew2883 3 місяці тому +1

    Awesome! 👏

  • @abdelazizabdelioua890
    @abdelazizabdelioua890 3 місяці тому

    يعطيك الصحة I have a project in mind and this is what I was looking for to monitize it.
    Thanks ❤

  • @joelfrojmowicz
    @joelfrojmowicz 3 місяці тому +1

    Great Project, but it will be even greater if you create a Docker Container with it and allow to use local AI (llama) instead of using cloud.

  • @CTEBACp6uja
    @CTEBACp6uja 3 місяці тому +2

    Did you try to add login option, for website requiring it?
    I tried, but ofter get a response from a website that my browser doesnt support JavaScript, or that it is not enabled and that it is needed to proceed to logging. Tried to enable it in Selenium, but still getting the same response.
    Btw, thanks for sharing this, very interesting!

  • @michaelwallace4757
    @michaelwallace4757 3 місяці тому

    Very nice! 🎉

  • @dewilton7712
    @dewilton7712 3 місяці тому +1

    I keep getting 'Unexpected data format for URL 1' with all sites I try. I have Ollama with Llama3.1 8b installed locally if that matters.

  • @paulham.2447
    @paulham.2447 3 місяці тому +2

    Que dire ? exceptionnel !? Merci Monsieur 👍

  • @mohamedamrbadawi
    @mohamedamrbadawi 3 місяці тому

    is it possible to add a search box feature where you put the search url headers for e.g. amazon ebay temu to get title price a mini price comparison feature in short?

  • @shawnsmith9198
    @shawnsmith9198 3 місяці тому +1

    You are king!

  • @explosiveenterprises1479
    @explosiveenterprises1479 2 місяці тому

    How would you utilize this to scrape from behind a login? I dont see any of the login info embedded in the URL structure so unsure the best way to do this.

  • @yazanrisheh5127
    @yazanrisheh5127 3 місяці тому +1

    Reda thank you for this video. I know in your previous version 2 of the scraper you allowed it to add delays to scrape a website but how would V3 work for infinite scrolling pagination instead of page 1, 2, 3 etc...

    • @redamarzouk
      @redamarzouk  3 місяці тому

      I have a 3 scroll events, the first to half of the page height, the second to almost the end and then a last one to the end of the page, i have random time delays between them.
      Do you think it's enough to do the infinite scroll ?

  • @ChijiokeObi
    @ChijiokeObi 3 місяці тому

    I believe to solve the maximum token issue is to first strip the html results for unnecessary html and script and style tags before sending it to LLM

    • @redamarzouk
      @redamarzouk  3 місяці тому

      the html2text already get rid of all tags and scripts, but maybe the urls as well can be removed and sometimes it does decrease the amount of tokens in the markdowns.
      but the problem is if the user want to extract urls of images or something else for example, what should happen in that case?

  • @CyrilSz
    @CyrilSz 3 місяці тому

    Incroyable merci :)

  • @aimenkigs
    @aimenkigs 3 місяці тому +3

    Love the project man! the update is the exact problem i faced before 🔥
    I’ve tried using GPT-4o-mini and Gemini Flash as well, and they both work smoothly. However, when using the local model, the pagination script throws an error on 'openai.ChatCompletion'. Could this be due to a version issue? Thanks

    • @redamarzouk
      @redamarzouk  3 місяці тому +2

      My issue with using the local Llama 3.1 8B was really the number of tokens, in my case it was 8k tokens per completion.
      If you have a model with a longer context window and it still giving you errors, join the discord and share the screenshot so I can understand the problem better.

    • @ranggasaputra5001
      @ranggasaputra5001 3 місяці тому

      @@redamarzouk Hello, can you send the discord link again because the link you previously provided has expired, Thanks 🙏

  • @JonathanBarber-hi3vj
    @JonathanBarber-hi3vj 2 місяці тому

    Thank you so much for this video. I am a no-coder and have no problem following your instructions. I have the last versions of VS and Python installed and for some reason I am unable to download the requirement packages. Can you please advise? Thank you

  • @chrystylord2324
    @chrystylord2324 3 місяці тому

    Hello!! great video. I want to ask you if it's possible de scrap a whole article for example with your tool. Unlike a lot of people here, i just want to read articles, light novels and some comics which are behind a paywall. Can your scraper help me with that or do i need to make some modifications to the code for it to work?

  • @Dmitrird
    @Dmitrird 3 місяці тому

    is it possible to build a table with different URLs and iterate over an automatic regime?

  • @alexscarbro796
    @alexscarbro796 3 місяці тому

    Does anyone know of a tool that can scrap a name and address blocks from a largely fixed area on each page, of a multi-page PDF?

  • @JayS.-mm3qr
    @JayS.-mm3qr 2 місяці тому

    Thank you for this very interesting scraper. But i just want a scraper that does not require paid api keys. Can someone PLEASE recommend a basic scraper for that please. Please.

  • @moeabdo3114
    @moeabdo3114 3 місяці тому

    Can this scrape from youtube ? For seo ? Thx for your amazing work

  • @vladlemos
    @vladlemos 3 місяці тому +1

    Muito interessante parabéns pela aula!

  • @Bryan-lu4du
    @Bryan-lu4du 3 місяці тому

    Could we use the app as an API? I want to have my app use your app essentially

  • @adriangpuiu
    @adriangpuiu 3 місяці тому +5

    you forgot to specify on how to activate the env after they created it. maybe some dont know how to do it and theyll install the requirments onto the main python env :P

    • @gamalfarag
      @gamalfarag 3 місяці тому +1

      thx, that solves the error in my setup, but I have another error
      ModuleNotFoundError: No module named 'scraper'

    • @redamarzouk
      @redamarzouk  3 місяці тому +1

      Yeah I should probably add that to the documentation

    • @adriangpuiu
      @adriangpuiu 3 місяці тому

      @@gamalfarag pip install scraper

  • @ambushtunes
    @ambushtunes 3 місяці тому

    How does one select multiple pages? It doesn't seem to work for me. Great job btw.

  • @SavanVyas91
    @SavanVyas91 3 місяці тому

    You doing local scraping not puppeteer?

  • @ZeyadAlmothafar
    @ZeyadAlmothafar 3 місяці тому

    Can i use it to scrape a linkedin profile data? and is that legal to be used commercially ? (to integrate the data into a web application through apis)

  • @omarunzainkun11
    @omarunzainkun11 3 місяці тому +1

    At your website one of the file is name "sraper" instead of "scraper" which eventually will cause no module not found. Newbies prolly won't realize this even though its very obvious. Just informing.

    • @redamarzouk
      @redamarzouk  3 місяці тому

      Thanks for letting me know, I fixed it!

  • @hasanparvez8850
    @hasanparvez8850 3 місяці тому +1

    Chunking the tokens for Alibaba can solve the issue.

  • @OPMultiplayerCoopGames
    @OPMultiplayerCoopGames 3 місяці тому

    How can I scrape emails from websites? I need to scan many of them, not just one per time, could you help me out? :)

  • @velocitai
    @velocitai 3 місяці тому

    La pluspart de mes scraping echouent a cause d une limitation de token avec gpt :/

  • @rajvaibhav821
    @rajvaibhav821 3 місяці тому

    Do we really need selenium driver and actually opening a browser? Can it be done without that? Headless?

    • @redamarzouk
      @redamarzouk  3 місяці тому

      I tried it with headless and headless new but it's a hit a miss with the infinite scroll cases. And most pagination details are at the bottom of the page.
      If you want to try it with headless go to assets.py, the headless option is already there just place it inside the settings list

  • @maxxflyer
    @maxxflyer 3 місяці тому +1

    very good

  • @tiagoreis5390
    @tiagoreis5390 3 місяці тому

    Do you have how many tokens is on SheIn? Great work

    • @redamarzouk
      @redamarzouk  3 місяці тому

      I didn't try with shein before, but they have a fairly simple website, the issue is that every page has 70+ products meaning it will make a lot of tokens.

  • @Benjaminborghini
    @Benjaminborghini 3 місяці тому

    I can't get this to work on Spotify-streams, I want to track all my streams across all my songs. I also made a HTML-link for it to scrape multiple links at one time, so nice that you fixed that now! But seems like Spotify is blocking it anyways. Any tips on how I could scrape this kind of data? Thanks!

    • @redamarzouk
      @redamarzouk  3 місяці тому +3

      If spotify is one those websites that force a captcha upon opening the website, that would block the scraping.
      Someone proposed to add an attended mode for the user to solve a captcha and then allow the app to continue its scraping. I think I will be adding this feature next.

  • @moonwhisperer4804
    @moonwhisperer4804 3 місяці тому

    im looking for a way to go from list page, find all items, go into the detail page of each item and extract data from there. can this do that?

    • @redamarzouk
      @redamarzouk  3 місяці тому

      Yes this is the most intuitive way but even specialized text to action apps out there can't do it in a universal way. It's really harder than it sounds like.
      That's why getting the pages and then scraping multiple urls of those pages at the same time is the most compatible way of doing pagination today.

  • @minhvuongluu7644
    @minhvuongluu7644 3 місяці тому

    can it scrape google map

  • @AnmolBatti-z5y
    @AnmolBatti-z5y 2 місяці тому

    Do it will bypass bot protection like captcha?

    • @redamarzouk
      @redamarzouk  2 місяці тому

      it doesn't explicitly bypass captcha if it arises, the trick is to use the useragent in order to stop the website from thinking we're bots in the first place.

  • @towhidurrahman8961
    @towhidurrahman8961 3 місяці тому +2

    "Great job, sir!
    I have a question: Is it possible to share the webpage opened by Selenium with the user, allowing them to manually interact with it-such as solving captchas or authenticating-to bypass blockades? Once they clear the obstacles, Selenium can resume scraping."

    • @redamarzouk
      @redamarzouk  3 місяці тому

      that's actually a great suggestion

    • @yazanrisheh5127
      @yazanrisheh5127 3 місяці тому

      Yes please reda this would be an amazing feature to do. This way we can pretty much solve every captcha without paying for proxies or coding to solve a captcha etc... We can just let it alert us by sending an sms to our phone or something that says "Need to solve captcha come back to your pc" or maybe just play an audio file saying "Solve the captcha"

  • @TheBestgoku
    @TheBestgoku 3 місяці тому +1

    this is great and all, how about you create a service, even if its paid. to help us not get banned for scraping. then we have something.

  • @AIPulse118
    @AIPulse118 3 місяці тому

    Can it scrape the openai docs? I have yet to be able to scrape their pages

    • @DevJonny
      @DevJonny 3 місяці тому

      do you mean the scraping part itself or the llm blocks the content? might want to try with scrapingbee

  • @pauljones7798
    @pauljones7798 3 місяці тому +1

    This AI Scraper Update Changes EVERYTHING!!.
    Please, can it Scrape Freelance services marketplace?

  • @Anesu-nv1mh
    @Anesu-nv1mh 2 місяці тому

    can it scrape photos and videos also and get it downloaded ?

    • @redamarzouk
      @redamarzouk  2 місяці тому

      it can scrape links of pictures and videos but not the files themselves.
      of course the links has to be inside the websites markdowns.

  • @jeynergilcaga
    @jeynergilcaga 2 місяці тому

    what about facebook?

  • @shankar9063
    @shankar9063 3 місяці тому +1

    Omg update

  • @SoshiForever1_SM
    @SoshiForever1_SM 3 місяці тому +1

    incroyable

  • @chrisder1814
    @chrisder1814 Місяць тому

    hello

  • @grahamrennie2057
    @grahamrennie2057 2 місяці тому

    Looks like your website is down...

    • @redamarzouk
      @redamarzouk  2 місяці тому

      I have just tried to access it and it's up, i checked on isituporjustme and it says it's working fine
      It's just you. automation-campus.com is up.
      Last updated: Nov 6, 2024, 10:14 PM (1 second ago)