This Open Source Scraper CHANGES the Game!!!

Поділитися
Вставка
  • Опубліковано 14 січ 2025

КОМЕНТАРІ • 303

  • @redamarzouk
    @redamarzouk  4 місяці тому +45

    Hey Everyone,
    LInk to code: www.automation-campus.com/downloads/scrapemaster
    My GITHUB account has been SUSPENDED (I have no idea why) and I didn't receive any warning or anything from Github justifying the suspension. I'm so confused because similar project of AI Scrapers are on github and none of them got suspended.
    I opened a ticket and I'm waiting for their answer.
    in the meantime I shared the code on my website with all the steps to reproduce the ai scraper.

    • @ShaunPrince
      @ShaunPrince 4 місяці тому +1

      Let me know if I can help with this. I can setup a Gittea on AWS or something.

    • @Kevinsmithns
      @Kevinsmithns 4 місяці тому +2

      Yeah I was just looking and about to comment

    • @alex_osti
      @alex_osti 4 місяці тому +2

      I was about to give it a shot.. Waiting for the update. Great work btw

    • @rperellor
      @rperellor 4 місяці тому +1

      I had the opportunity to view it, but did not clone it

    • @redamarzouk
      @redamarzouk  4 місяці тому +8

      @@rperellor here is the code www.automation-campus.com/downloads/scrapemaster

  • @RoughSubset
    @RoughSubset 4 місяці тому +165

    So I worked at a company once where the data guy built his own web scrapper to scrape data off of our competitors website for pricing etc. One thing that they did to protect their website from scrapping was user-agent filtering, in order for him to overcome this limitation was to have a very long list of different user-agents and rotate them while scrapping the website. I think that will be a good addition to add into your app. A small but useful change.

    • @redamarzouk
      @redamarzouk  4 місяці тому +18

      Yes if we launch the scraper with the same user agent for the same websites so many times they will pick up on it and block us.
      the modification will have a list of OS credentials with their versions and different browsers and their versions.

    • @markomarjanovic8348
      @markomarjanovic8348 4 місяці тому +19

      @@redamarzouk Would it be possible to have a video about proxy rotation implementation? There is not much of it on YT but i think its crucially important.

    • @redamarzouk
      @redamarzouk  4 місяці тому +17

      @@markomarjanovic8348 Added to the backlog

    • @amortalbeing
      @amortalbeing 4 місяці тому +2

      this is a good suggestion, would like this to be added as well.

    • @internetperson2
      @internetperson2 4 місяці тому +1

      Thirded

  • @jdnilsen
    @jdnilsen 3 місяці тому +2

    Thanks!

  • @thisisfabiop
    @thisisfabiop 4 місяці тому +24

    Amazing work! It works great, but it doesn't handle cases where the database is divided into pages instead of using infinite scroll. It would be fantastic if it could also navigate through the pages until there are no more left.
    Another great feature-although it might make the tool more expensive, so it could be offered as an optional, selectable feature in the UI-would be for the scraper to open each item's page and scrape data from there. As you know, the initial page often only displays limited information about the product.

    • @theindubitable
      @theindubitable 4 дні тому

      Totally agree. I use axiom today, really good tool, that does just that.

  • @SergeyNumerov
    @SergeyNumerov 4 місяці тому +34

    Pretty cool.
    Let me point out, though, that the main complexity with scraping is that often times the relevant content is hidden: that is, getting to it may require clicking various UX elements.
    So to _really_ crack Scraping with AI, we'll need to go agentic: the solution will need to figure out what to click in order to reveal information of interest.

    • @SpragginsDesigns
      @SpragginsDesigns 4 місяці тому +4

      Exactly. Anyone interested in helping me make something like this? Or is there something available already?

    • @pyros4333
      @pyros4333 3 місяці тому +1

      ​@@SpragginsDesignsyou could just hire someone to build it for you easily

    • @jryde421
      @jryde421 15 днів тому

      ​@SpragginsDesigns i know a community that helps build tools like that for free

  • @justjosh1400
    @justjosh1400 4 місяці тому +7

    Definitely going to use this, I think this is awesome. As a suggestion for future options it would be great to have pagination support and levels deep. Has a lot of my scraping his location-based, for instance States-cities-locations. And the data I usually want is within the locations which may only be a few.

    • @redamarzouk
      @redamarzouk  4 місяці тому +3

      Thank you.
      Yes Pagination will make this complete.
      But I’m thinking how can I make it universal, cause it has to work on every website, so would I just add another llm call to detect any url pagination pattern or do you have a better idea on how to do it ?

    • @justjosh1400
      @justjosh1400 4 місяці тому +1

      @@redamarzouk that might actually work using a lower model would be capable of determining if the page has pagination. Or have a checkbox for user to manually say it has pagination so the LLM will be looking for it. That way it's not always looking for it. And when it finds it return what kind of class it is. IDK

    • @wdonno
      @wdonno 4 місяці тому

      @@redamarzouksimilar scenarios may be an interim pathway: if the initial url prompts for a selection of (text input) that determines next page, can you add the ability to make that selection, ideally from a list of items of prior interest? The recursive ability to select specific buttons to push according to options on following pages would then solve a large number of use cases (ie an ability to map different actions according to a preselected known option types)? The base use case is to download files from a selection post which varies by initial (or ideally subsequent) text inputs, terminated by pressing a button to download a file or selected files). The approach can then be expanded to add more scenarios, until it is universal!

    • @justjosh1400
      @justjosh1400 4 місяці тому +1

      Thinking about and just thought maybe have an area to manually put in div container that the user can grab from the inspect tool.
      Or..
      Since we're using a LLM you could always prompt for it and return the value of the container. Such as look to see if this page has pagination at the bottom or top if so return a value perhaps and use that value to fill in

  • @moiguess3256
    @moiguess3256 3 місяці тому +1

    You earned a new subscriber. Algerian brother here.

  • @danielcave9606
    @danielcave9606 3 місяці тому +2

    Most of the "traditional" Enteprise grade scraping tech companies are adopting LLMs into their stack as an option for when it makes sense. When you're scraping millions/billions of pages every 100th of a cent matters, so taking a composite AI approach, using ML models to get the majority of the standard data points for a general schema cheaply, and then allowing LLMs to the thing they do best at extracting data from unstructured text to extend that schema, that way you get eh cost efficiency with the flexibility of LLMs when needed.
    The real benefit of the LLM approach for bigger teams/projects is actually that is abstracts away from hard coding selectors into your spiders, so they are far more robust and unlikely to break in 3 months when the website changes its HTML, reducing your maintenance burden/debt. Thats my 10 cents anyway.
    I personally love what your project does for the everyday person though, getting small/medium crawls done where price per request isn't so important, and where you will have time/space for more rigorous custom QA. I especially love it for content generation purposes, data journalism, chart porn and the like. Great work!

    • @redamarzouk
      @redamarzouk  3 місяці тому

      Yeah I thought I was creating a scraper at scale, but once started using it extensively I see it more as a productivity tool to help get the data quickly without the need for copy paste.
      Traditional scrapers will still have a place in the market simply because once you want to scrape hundreds of thousands or millions of pages, the cost of paying coders for custom scripts and maintenance will make sense compared to the value of the data scraped.

  • @JordanCrawfordSF
    @JordanCrawfordSF 2 місяці тому

    0:36 - dude got possessed by ChatGPT and his eyes went bananas.

  • @MoneylessWorld
    @MoneylessWorld 4 місяці тому +5

    The dependency on OpenAI and the API key is a bummer.
    It would be better if we insert our own open-source AI engine and models.

    • @sixman9
      @sixman9 3 місяці тому

      If I'm not wrong, tools like Ollama use some of OpenAI's API surface to expose local LLMs. The docs read 'for chat/completions'.
      if this scraper is using OpenAI's function calling interface, you might be out of luck.

    • @91Chanito
      @91Chanito 3 місяці тому +1

      You can do that with your local llm.

  • @ginocote
    @ginocote 4 місяці тому +4

    One of my idea is to create or use a AI scraper to get the first scrape test. If it work you do output somethine like a json that will get the id or class of the scraper element, tant you give this json to your conventional no AI scraper to scrape the website for free and faster without the need of AI afterware.

    • @lovol2
      @lovol2 4 місяці тому

      This is just writing code. Just copy paste the html into chatgpt and say write the code to parse into JSON.. works really well.

  • @rgsiiiya
    @rgsiiiya 3 місяці тому

    This, and the V2 with Llama, are very interesting concepts, and I believe could be tremendously valuable.
    The shortcome is that it is very limited to just the single page at the URL location.
    To be truly valuable, it needs to also be a scraper (as you mention).
    Think of the use case to scrape ecommerce sites for product details. any "real' ecommerce site is going to have many many categories and pages of categorized product listings.
    While you can set up traditional scrapers and manually configure the navigation, this should be where AI should really shine. It should be able to figure out the navigation and automatically navigate/scrape the site.

  • @mzahran001
    @mzahran001 4 місяці тому +3

    Thanks for the great video. Idea for nest videos: Could you extend the code with crawling, for example, getting results from search engines or following a specific path to get more structured data?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      You're welcome, can you elaborate more on how it should look like ?
      Because this will be awesome and I actually gave it some thought, but it's hard to get the exact link of multiple pages from which you want to extract data if you don't have the link for the first page.
      you think we can trust a search engine to give us the exact links we want to scrape data from?

  • @dimadem
    @dimadem 4 місяці тому +1

    so good idea and explanation, thank you

  • @minissoft
    @minissoft 4 місяці тому +7

    Hello Reda, you should use Polars instead of Pandas, in a lot of cases is much faster than Pandas
    Also add_argument("--disable-search-engine-choice-screen") is useful + ("--headless") maybe?

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      Oh I was looking for that argument "-disable-search-engine-choice-screen" that pop up is annoying ( even if it doesn't affect the scraping). I will be adding that, thank you!!

  • @shawnsmith9198
    @shawnsmith9198 4 місяці тому +4

    you are genius! I am on a mac, so I just had to change the driver call, but everything else is working well. pagination or series of urls would be cool. i love how you have it load in the chrome browser. this really changes how i think about cross platform apps. i wonder if we can scrape instagram now. or what about downloading images? maybe a simple copy table button, since I just copy and paste into google docs.

    • @jimbob3823
      @jimbob3823 4 місяці тому

      New to macos can you please share your driver path? Not 100% which is the executable. Ty!

    • @thecashlessgamer480
      @thecashlessgamer480 4 місяці тому

      Yes please can you help me set it up on my mac as well?

    • @wavelyveney9021
      @wavelyveney9021 3 місяці тому

      I need assistance in setting up on a mac

  • @SamirDamle
    @SamirDamle 4 місяці тому +9

    Thanks for the simple tutorial and code.
    Can you add an example of using this scraper with local Ollama and Llama 3.1 instead of OpenAI to make it totally free?

    • @redamarzouk
      @redamarzouk  4 місяці тому +5

      You’re welcome.
      I can add it but I won’t be able to test it.
      My small gpu can’t really handle it especially when I’m filming.

    • @HyperUpscale
      @HyperUpscale 4 місяці тому

      @@redamarzouk YES, PLEASE 🙏!!!

    • @GundamExia88
      @GundamExia88 4 місяці тому

      @@redamarzouk I hope this get added. I prefer to run Ollama locally. I'm only using a GTX 1070, it works fine.

    • @idrinkmusic
      @idrinkmusic 4 місяці тому +1

      @@redamarzouk this would be a game-changing update. You earned a sub for this video regardless.

    • @carvierdotdev
      @carvierdotdev 4 місяці тому

      ​@@GundamExia88 could you please tell me what models you run? I have the GTX 1080 Ti 11GB, thanks to a friend, and I want to play with that but I don't even know it's possible 😂😅

  • @TheLionsaba
    @TheLionsaba 4 місяці тому +1

    Great video as always , only downside is that it is adressing people who work with code and experienced in data scraping , but for no code or very little code like me , i think the best way is to use computer vision models , Vllm , chatgpt already have it in their api , but also we have 2 new open source models that just got ou this week , Qwen 2 VL , and microsoft phi 3.5 vision.

    • @quercus3290
      @quercus3290 4 місяці тому

      LAION have a model in open source, it is a very powerful scraper, you will most likely need to fine tune any vision models.

  • @orangehatmusic225
    @orangehatmusic225 4 місяці тому +3

    So you can scrape 666.66 pages for $1 based on that usage.

  • @remusomega
    @remusomega 4 місяці тому

    a really cool feature would to add a text-splitter where it splits the text semantically into small chunks so we can readily use this to feed a RAG. Right now we typically splice things arbitrarily, but semantic splitting is the best.

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      can you give me an example of an output to split?

    • @TimothyJoh
      @TimothyJoh 4 місяці тому

      There are many such splitters available in llamaindex or langchain already. Another “automated” way might be to ask GPT 4o mini to split for you

  • @aleksandars9254
    @aleksandars9254 3 місяці тому

    Thanks dor the video! What mic are you using?

  • @ScottLahteine
    @ScottLahteine 4 місяці тому +3

    The use case I have for a script like this one is to scrape my own open source project code history to convert several versions of config files that contain lots of good documentation into YAML that can be deployed to a Jekyll website. So all the same principles apply, especially the need to output consistent structured data. I look forward to learning more about the development of this new way of scraping and applying it to my own situation. Cheers!

    • @lawrencemanning
      @lawrencemanning 4 місяці тому

      The problem is you now will have an indeterminate algorithm taking you from input to output. In other words the mechanism will be fundamentally untestable and unrepeatable. It’s basically the same as feeding data to a bunch of chimpanzees and expecting them to perform the same processing on it. In other words this is fine if you have a human to check the output each time (the interactive use case) but any kind of automatic, unattended runs? Forget it.

  • @HyperUpscale
    @HyperUpscale 4 місяці тому +4

    Can you make it to use ollama on the back instead of OpenAI?

  • @snehasissnehasis-co1sn
    @snehasissnehasis-co1sn 4 місяці тому +13

    I want to use groq api key bcoz it's free to use or local llm like ollama..... Please modify this code if possible......Great video.....

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 4 місяці тому +1

      same question, I was wondering can we do it using groq or cohere?

    • @ianmatejka3533
      @ianmatejka3533 4 місяці тому +1

      Wrap the groq api key by os.getenv() instead of passing in the string

    • @redamarzouk
      @redamarzouk  4 місяці тому +5

      @@snehasissnehasis-co1sn both has been added.
      Will present them in the next video.

  • @CryptoDuhd
    @CryptoDuhd 4 місяці тому

    I would love it even more if you created a docker container that was just downloadable and thereby installable directly on a Linux site. A user agent swap feature (like a list of user agents that could be chosen like round robin algorithm, or randomized) would be great too and handling a list of proxies that would also be swapped.

    • @redamarzouk
      @redamarzouk  4 місяці тому

      I haven't created a docket container, but I made a random user agent pick from a list. you can find the code to that in this video ua-cam.com/video/xrt2GViRzQo/v-deo.htmlsi=smByssvvNhudzgRS
      What type of websites you will use this app to scrape from?

  • @LeftBoot
    @LeftBoot 4 місяці тому +1

    How deep / how many 'pages in' will it go?

  • @daedaluxe
    @daedaluxe 4 місяці тому +1

    I don't think llms are ready for this scraping yet, better to get an llm to make a flask python app and make it manually scrape based on class names so you pull correct data with no hallucination, can also pull images and zip the images with zipfile

    • @redamarzouk
      @redamarzouk  4 місяці тому

      LLMs are not made the same, while I was scraping websites with 60K+ tokens I noticed that gpt4o mini gets me only a subset of the data while gpt4o latest manages to get me all the data.
      If someone is willing to pay 0.5 to 1$ per extraction, they can use gpt4o with a guaranteed correct and complete output.
      But 1$ an extraction is still very high if we want to scale it, in that sense it’s not ready.
      But for most cases mini works great with 0.005$ per extraction and it’s absolutely ready for anything.

  • @marcusmayer1055
    @marcusmayer1055 4 місяці тому +2

    How to Add local llm llama for this projekt?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      I did, watch this video ua-cam.com/video/xrt2GViRzQo/v-deo.htmlsi=XWUzIu8uBehK4AV5

  • @echobucket
    @echobucket 4 місяці тому

    I would not trust this to not hallucinate. I think of a famous example where it misinterpreted the column and concatenated some numbers together instead of treating them as separate columns, leading to incorrect values.

    • @redamarzouk
      @redamarzouk  4 місяці тому

      most data in tables results in line breaks between values in markdowns.
      can you share the use case where it has hallucinated for you, it will be very interesting use case?

  • @sahil5124
    @sahil5124 4 місяці тому +1

    So its traditional scraping (selenium and beautiful soup) and AI is only used to organize the scraped data in a given format. The AI does not do the scraping. Is it correct or am I missing something?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      Yes the AI does the parsing. but creating unstructured markdowns can't really be called traditional scraping, no one will scrape the whole unstructured data from the html in a traditional setup.

  • @Ant-ym3mw
    @Ant-ym3mw 3 місяці тому +1

    You got yourself a new sub!

  • @blunoodle
    @blunoodle 3 місяці тому

    I used replit Ai agent to build + deploy a Kickass website scraper in like 10 mins!

  • @nmlker
    @nmlker 4 місяці тому +1

    @redamarzouk Nice and easy scraper. I saw that you also have Scrapemaster 2.0 and installed that. The Env file mentions a Google API key. Which one should be added? Have a link where to get this particular Google API key?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      Thank you, to use the google API Key go to aistudio.google.com/app/apikey
      and from there create a new api key and add it to the .env.
      You can find all the details of the scarpeMaster 2.0 from here
      ua-cam.com/video/xrt2GViRzQo/v-deo.htmlsi=KH5bfxyYJ9NV90FU

  • @brbl415
    @brbl415 3 місяці тому +1

    does it bypass re-captcha?

  • @CicadaMania
    @CicadaMania 2 місяці тому

    Does a Disallow statement in the robots.txt like Disallow: User-agent: GPTBot stop it from working?

  • @staticalmo
    @staticalmo 4 місяці тому +6

    No pagination?

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      Check the new video, the scraper works with Llama3.1 and Qroq model Llama 70B for free: ua-cam.com/video/xrt2GViRzQo/v-deo.html

  • @djasnive
    @djasnive 4 місяці тому +3

    Great Project.
    Is it possible to use OpenSource and Self Hosted model like Llama ?

    • @redamarzouk
      @redamarzouk  4 місяці тому +2

      Thank you.
      Yes it's possible, but I didn't even try this time because gpt4o and Gemini flash are so cheap and have a huge context window and I just went with them.
      But it's perfectly possible, you just need to modify the "format_data" function.

    • @satyaviswapavanranga5915
      @satyaviswapavanranga5915 4 місяці тому

      @@redamarzouk Thank you so much, I had the same question, Thanks for answering.

  • @maxxflyer
    @maxxflyer 3 місяці тому

    if I show the screenshot of the pokemons to gpt it will directly scrape all the data. so basically my first feeling is the AI is enough smart to suggest the fields in a dropdown menu. so I can choose them and tell what I really want. And decide a final label for each one of them.
    ...just an example to start!
    but as I said chatgpt can do the same just with a prompt. I don't actually need your app unless the page is full of data. in that case there may be limitations.
    so you should ask your self what a prompt can't do
    anyway my real problem is to have a scraper able to scrape data that are distributed around various pages. or for those cases where you must "load more" elements clicking a button.
    and I want to be able to specify the download format. gpt can reformat anything to anything.
    nice work but there are tons of improvements to be made. I will follow you to see where you get to.

  • @GabrielM01
    @GabrielM01 2 місяці тому

    Would be nice to have a option to use ollama so we can run it locally without using openais proprietary ai

  • @iltodes7319
    @iltodes7319 4 місяці тому +1

    Good job bro continue ❤

  • @ErickXavier
    @ErickXavier 2 місяці тому

    What about adding Pagination Support? Where the A.I. will go through pagrs and pages to scrape long paginated data?

  • @aveenof
    @aveenof 4 місяці тому +2

    Awesome work! Any idea why scraped output list gets truncated even if input+output tokens < max?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      in some cases I noticed that gpt4o mini can't extract all the data from the website.
      I tried with gpt4o and it was successful.
      So if you're sure your data is in the markdowns and gpt4o mini didn't pick it up, try with gpt4o.

  • @amortalbeing
    @amortalbeing 4 місяці тому +1

    This was great thanks.

  • @BohemianAnarchy
    @BohemianAnarchy 4 місяці тому

    Curious Why not puppeteer?

  • @stokedbeachbum
    @stokedbeachbum 4 місяці тому +1

    Can you also crawl a site such as Zillow and scrape multiple URLs?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      websites like zillow tend to have sooo much data inside of them 100K+ tokens, but the answer is still yes.

  • @mrsai4740
    @mrsai4740 4 місяці тому +1

    Hmm It seems like i ran into a limitation. I tried scrapping some golf course (lattitudes and longitudes) from google maps, but It only seems to ever give me 30 rows of data. At first i thought this might be an issue with max tokens, but i increased the max to the highest value possible: "16384" tokens, but this still only gave me around 30 rows with the same data

    • @redamarzouk
      @redamarzouk  4 місяці тому

      What model have you been using because gpt4omini can go up to 128000 tokens, and in my last video I've added gemini which can go up to more than 1M+.
      I've noticed this behavior as well, when a single page has sooooo much data, not just the table with the necessary data but other data, we run into a hard limit on how many rows we can scrape (Especially with apps like @irbnb and zill0w where there is a map that have so much data we won't be scraping), I guess you found the same limitation.

    • @mrsai4740
      @mrsai4740 4 місяці тому

      so i have been experimenting with this code and I got it to work with pagination by specifying a new field for a next button and a new field for number of pages. This seems to work well, but it also got me thinking: If we have too many tokens, we can probably try to chop the data up and then run the peices through the llm. The only thing i can see, is that if we start batching the data, we could end up missing critical peices of imformation (if we substring ot the worng spot, we may end up missing rows). I will try out gemini, i have never used it

    • @redamarzouk
      @redamarzouk  4 місяці тому

      @@mrsai4740 on some websites we can get either the next page or the new the url of the pages just by specifying it in the fields using this current version of the scrapper.
      But the problem is that most websites don't include all the url of the pages in the first page, usually it's under the form
      (1 2 3 4 ....45 46 47 48) For example.
      In this case we have to ask the LLM to conclude the url of the other pages using the pattern from the urls that it found.
      Other websites where we only have the next button can only be scraped one url at a time, so the universal approach will need some time and work to be figured out.

    • @mrsai4740
      @mrsai4740 4 місяці тому

      @@redamarzouk hmmm maybe we are tackling this in the wrong way, cause it seems like for this to be a universal solution, some legwork by the user needs to be done. In cases like that scrapeme site, yeah it is allot easier to provide an array of urls or a template that describes all the urls, but this doesn't tackle the problems of single page applications. Some sites have a paginator that modifies the current page with updated information. I guess it's back to the question: "how can we programmatically detect the way a site is paginating data?"

  • @jewlouds
    @jewlouds 4 місяці тому +1

    it actually works pretty good.

  • @DummyAllan
    @DummyAllan 4 місяці тому +1

    I really appreciate the great work your are doing.
    Quick one, what happens to sites that require credentials? How do you handle that case?
    Thanks

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      That will need an intervention for your side, keep the website open and run the process again so it has access directly to the data.

  • @aleksd286
    @aleksd286 4 місяці тому

    Problem isn’t to scrape the data, it’s if you have a public facing website most likely you’ll get sued. Nowadays data is a copyrighted material

  • @SohanDomingo
    @SohanDomingo 4 місяці тому

    What video recording software you use?

  • @Alphamaan
    @Alphamaan 2 місяці тому

    Can this app click on a car's page to scrap the details and go back to click on another car's page to scrap the details again?

  • @KPK_7
    @KPK_7 3 місяці тому

    Any way to scrape Twitter specific keyword

  • @brianzvc
    @brianzvc 2 місяці тому

    does this scrape dynamic data?

  • @Web.Scraping
    @Web.Scraping 4 місяці тому

    What about captcha solving, such as cloudflare, recaptcha, hcaptcha..

  • @asanadaniel497
    @asanadaniel497 12 днів тому

    Can we access the Website Directly ?

  • @ditleporc
    @ditleporc 4 місяці тому

    Good job Reda, what'sup with your we automation-campus website ? is it down ? too much success ?

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      Thank you. but the website is up for me I've just checked on multiple devices and on isitdownorjustme, all working.

    • @ditleporc
      @ditleporc 4 місяці тому

      @@redamarzouk Zscaller classified your site as suspicious....

  • @chandler_short
    @chandler_short 3 місяці тому

    How about something like scraping facebook marketplace or offerup?

  • @TLCMEDIA1
    @TLCMEDIA1 4 місяці тому +1

    This is amazing, I have been trying to reproduce the code but I keep getting errors. Any chance you can do a dummy video . Step by step as chat gpt does ? Please 🙏🏾

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      I did watch this video ua-cam.com/video/xrt2GViRzQo/v-deo.htmlsi=XWUzIu8uBehK4AV5

    • @TLCMEDIA1
      @TLCMEDIA1 4 місяці тому

      @@redamarzouk appreciate you so much 🙌🏾💯

  • @VaibhavShewale
    @VaibhavShewale 4 місяці тому +1

    lol, in college time i made a web scraper as my project and got full marks XD

  • @BaldyMacbeard
    @BaldyMacbeard 4 місяці тому +5

    Ah yes. Finally... an even more expensive way to scrape sites than we used to have...

    • @redamarzouk
      @redamarzouk  4 місяці тому

      can you elaborate on what part you think is expensive?
      is it the scraping I made or just generally speaking ?

    • @the_real_cookiez
      @the_real_cookiez 4 місяці тому

      Beautifulsoup is free. And anything with Llm apis are not scalable cuz it's per usage. ​@@redamarzouk

    • @realmstupid-on8df
      @realmstupid-on8df 3 місяці тому

      $0.0015 is nothing. I bought $1 in Bitcoin at this amount.

    • @ravendude3632
      @ravendude3632 16 днів тому

      When you scrape one page, it's cheap. People scrape thousands of pages for data. That would rendered it useless.

  • @Daltoncast
    @Daltoncast 4 місяці тому

    Takes a screenshot then extracts with AI?

  • @moeabdo3114
    @moeabdo3114 3 місяці тому

    Can this scrape from youtube ? For seo ? Thx for your amazing work

  • @SavanVyas91
    @SavanVyas91 4 місяці тому

    Pagination will be critical for this

  • @JuankM1050
    @JuankM1050 4 місяці тому

    then i tried to make it work with the google gemini api, and sadly i could not. it always returns the empty table.

    • @redamarzouk
      @redamarzouk  4 місяці тому

      I've just added gemini to an updated script I'm working on, I also added Llama 3.1.
      stay tuned for the next video.

  • @djagryn
    @djagryn 4 місяці тому +1

    Super intéressant 🎉

  • @mikevinitsky8506
    @mikevinitsky8506 4 місяці тому

    can you make it for it to spider a website and if it finds a page that has all the required tags it puts the information in json, database, etc?

  • @eea8888
    @eea8888 4 місяці тому

    What if the data should be dynamic or there will be some click like search button, or their is select to choose from, and after that, scrap the data? What should we do in that case ?

  • @Anton112eclipse
    @Anton112eclipse 3 місяці тому

    how does it work with pagination?

  • @edma6613
    @edma6613 4 місяці тому

    Could it download or summarize the files (pdf…) from a website?

  • @joshd265
    @joshd265 4 місяці тому

    Please can you host this tool online so that us non dev folk can easily access it. Also, would be great to have the ability for the model to be able to summarise and pull keywords out of long product descriptions etc.

  • @eightrice
    @eightrice 4 місяці тому

    there is no need to parse the actual scraped data through the LLM

    • @redamarzouk
      @redamarzouk  4 місяці тому

      I didn't scrape the structured data, but rather unstructured markdowns. So parsing is necessary in my case to get the table I want.

  • @LeftBoot
    @LeftBoot 4 місяці тому

    Can it be multimodal? Viewing data in an image, also creating data tables into an image. Eg. Create a wallpaper of the most important LINUX keyboard shortcuts. etc

  • @ghostwhowalks2324
    @ghostwhowalks2324 4 місяці тому

    can you use playwright as well ?

  • @younube2
    @younube2 4 місяці тому

    Does this work on Amazon?

  • @lyusvirazi6006
    @lyusvirazi6006 4 місяці тому

    Can you scrape PDF file from a website with this?

  • @atultanna
    @atultanna 4 місяці тому

    This a great job Hope you could share a code for auto blogging Looking around but not able to find much Where to get in touch

  • @neylz
    @neylz 4 місяці тому

    can this be used to scrape amazon data?

  • @grahamahosking
    @grahamahosking 4 місяці тому

    Is it possible to add this to Home Assistant?

  • @SoSoInfinite
    @SoSoInfinite 3 місяці тому

    Can this scrape eBay api?

  • @younube2
    @younube2 4 місяці тому

    Can you input multiple URLs and have the scraper collate + populate the same file?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      It can't do that today, but it will be a great addition.

  • @viejitoloco4133
    @viejitoloco4133 4 місяці тому

    why do all that random stuff? what's the purpose?

  • @anianait
    @anianait 4 місяці тому

    Or in Chrome, use the menu "Save web page as .... "

  • @w3whq
    @w3whq 4 місяці тому +1

    Great resource.

  • @danielerikschaconbaquerizo2957
    @danielerikschaconbaquerizo2957 4 місяці тому

    whay about using library curl_cffi with requets to simulate a browser instead of selenium or playwright instead of selenium ? i think it would be faster.

  • @Cygx
    @Cygx 4 місяці тому

    why do I need to use a llm for scraping the data?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      Yeah for 1 of 2 websites it's doesn't make sense, but to scrape any website with 1 single app is pretty useful.
      Will you still prefer the traditional option even if you have to create a script every time ?

  • @kakamoora7874
    @kakamoora7874 4 місяці тому +1

    It’s working…. But problem was some missing data… it’s given the own data…

    • @redamarzouk
      @redamarzouk  4 місяці тому

      That actually gives me an idea of adding a text box where you can optionally add some instructions about the specific website you're scraping.

  • @menachem-145
    @menachem-145 4 місяці тому

    how can i work with this on mac?

  • @AbderrahmaneMotrani
    @AbderrahmaneMotrani 4 місяці тому

    Nice work Reda, I was actually for something like this. I tried to access the repo but the link says 404 not found.

    • @redamarzouk
      @redamarzouk  4 місяці тому

      yeah github banned me for some reason, here is the link to the entire code:
      www.automation-campus.com/downloads/scrapemaster

  • @SiliconSouthShow
    @SiliconSouthShow 4 місяці тому +1

    (sigh) now, make it work with ollama with free llm's, so...I don't support cost f anything not low or cheap, free is king, when it comes to cost, these are things you can do paying services for cheapo and low cost.. And don't have to write anything. But.....I appreciate the value in explain, sorta what does what within the script (the dependencies). This is useful to many folks out there, I know when I was in a certain times it was valuable to me.

  • @bfamily787
    @bfamily787 4 місяці тому +3

    Great video, can you show how to implement local LLM like Ollama instead of openAI?

    • @redamarzouk
      @redamarzouk  4 місяці тому +1

      Thank you ,
      This has been demanded so many times I guess I have to make a new video about it.

  • @peladoclaus
    @peladoclaus 3 місяці тому

    Whats better about this than google advanced search?

    • @redamarzouk
      @redamarzouk  2 місяці тому

      I don't see how they're similar.
      I'm not searching for anything, i'm giving an exact url from which I want to extract structured data using an LLM.

  • @abdopower5913
    @abdopower5913 4 місяці тому +2

    Are u Moroccan or Algerian ?😊

    • @moiguess3256
      @moiguess3256 3 місяці тому

      Moroccan, easy to find out.

  • @cineymatic
    @cineymatic 4 місяці тому +2

    Great video! I have a few questions though 🤔:
    - Would it be easy to extend it to first log in to a site and then start scraping?
    - Would it be able to click buttons and scrape data from subsequent pages?
    - How is it identifying the elements on the page? Should it always be under a category or in the form of a table?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      for the first 2 questions the answer is no, unless we're creating it for specific websites, otherwise we have to create a universal text-2-action module with it (which is infinitely harder to do )
      For the last question, as far as the element doesn't need a ui/ux action to show, the scraper will pick up on it.

    • @cineymatic
      @cineymatic 4 місяці тому

      @@redamarzouk Thank you for the response.

  • @ravendude3632
    @ravendude3632 16 днів тому

    Table aggregator 9000.
    Give him any obscure website with dynamic data. and It'll definitely fail.

    • @redamarzouk
      @redamarzouk  16 днів тому

      Can you share the website so I can test it as well?

  • @imsjs78
    @imsjs78 4 місяці тому

    sorry but where can I see the actual code? should I register any website?
    or is there any link?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      The project GitHub link is in the description.

    • @mertgokce6385
      @mertgokce6385 4 місяці тому +1

      @@redamarzouk Is there sth wrong with your github ? Because it is not accessible.

  • @Divyv520
    @Divyv520 4 місяці тому

    Hey Reda , really nice video ! I was wondering if I could help you with more Quality Editing in your videos and also make a highly engaging Thumbnail and also help you with the overall youtube strategy and growth ! Pls let me know what do you think ?

  • @YourFactoFactory
    @YourFactoFactory 3 місяці тому

    You can ask any AI to create a python script to scrape any web using selenium, Quick, easy and free

  • @obey24com
    @obey24com 4 місяці тому +1

    What about websites with cloudfare security etc.?

    • @TheLionsaba
      @TheLionsaba 4 місяці тому

      Very important question.

  • @daithi007
    @daithi007 4 місяці тому

    Do you have to manually accept cookies?

    • @redamarzouk
      @redamarzouk  4 місяці тому

      No I didn't need to do so for the websites I scraped

  • @cameronyking
    @cameronyking 4 місяці тому

    Can this be an API?