This is how I scrape 99% websites via LLM

Поділитися
Вставка

КОМЕНТАРІ • 120

  • @squibtechnologies
    @squibtechnologies День тому

    You've made 2024 my most productive year. Between your demos and cursor anything's possible. Thank you!

  • @JJ-tr8cu
    @JJ-tr8cu 26 днів тому +20

    Thanks for doing the dirty work and doing a comprehensive comparison!

  • @kryptobash9728
    @kryptobash9728 27 днів тому +20

    wow agentQL is nuts!!

  • @TheAIBlueprint
    @TheAIBlueprint 9 днів тому +3

    Man! I have been loooking for your content but forgot your account name, this finally came up! Last year I asked you what I needed to learn to get to your level coming from little programming experience and you said start with prompt engineering, I have since gotten certificates from Vanderbilt in prompt engineering, Certificate from Harvard Online in Python and Probability, and working towards a data science. And I FINALLY can follow your stuff, except now I see JavaScript (doh!!!!!) do I need to learn JavaScript or is the JSON library in Python enough?
    Also, your content is so dang good, I recommend adding some catchy tune, or some fancy logo to remind people of your channel. So for branding, Don’t do it RIGHT at the start, that’s always for the hook, but right after the hook adding maybe a short 2-3 second jingle, and a cheery “I AM JSON AI, let’s get started” or something catchy that reminds us of your channel each time.
    Anyways, just a suggestion, keep putting out these awesome videos man!

    • @smthngsmthngsmthngdarkside
      @smthngsmthngsmthngdarkside 6 днів тому

      Add jingles and I will unsub.

    • @TheAIBlueprint
      @TheAIBlueprint 3 дні тому

      @@smthngsmthngsmthngdarkside Jingle bells jingle bells jingle all the way....
      I guarantee you that if he adds a 1-2 second tune intro, marketing and putting his name out there for people to remember, but his content is still THIS good, you would NOT unsuscribe. Haha... if you would, then I have no words. lol

  • @NLPprompter
    @NLPprompter 26 днів тому +21

    Alan Turing is Smiling in heaven

    • @ScottzPlaylists
      @ScottzPlaylists 26 днів тому +6

      Thats a nice comment, but it's based on a false premise that most falsely believe because of ignorance.
      The 'dead know nothing' so they 'sleep in the grave' until the 2nd coming . (time passes instantly when you sleep)
      then all the saved will rise into the air to meet Jesus at the same time (the 1st resurection)
      -- almost ('the dead in christ will rise first, then the living' )
      Then after the saved have been in heaven for 1000 years, the 2nd resurection happens -- all the lost.
      they are judged and thrown into the lake of fire. The word is very clear on all this if you study.
      There's more at the 1000 year mark, and 10,000 year mark, but don't want to preach here.

    • @Python_Scott
      @Python_Scott 26 днів тому +4

      @@ScottzPlaylists You know your Bible!! Thanks.

    • @AGIBreakout
      @AGIBreakout 26 днів тому +3

      @@ScottzPlaylists Good to know... Thanks.

    • @NWONewsGod
      @NWONewsGod 26 днів тому +2

      @@ScottzPlaylists Straight Truth -- I like it.

    • @NWONewsGod
      @NWONewsGod 26 днів тому +1

      @@ScottzPlaylists It's seems nicer to know that you go there together, and right now , they sleep.
      They don't have to watch the horrors if this earth.
      The truth is better than the lie. So spirits are de-mo-ns trying to deceive us. They can appear and speak, act, look, exactly like the dead. After all, thy were present their whole life, trying to temp, and deceive.
      The D's know us than any human, plus the've had thousands of years of practice and observation.
      Everyone has an Angel and a D assigned.

  • @amzpro5734
    @amzpro5734 6 днів тому

    Great video. Do you recommend doing some kind of pattern replace on the markdown before it goes into the AI API, to get the character count down?

  • @preben01
    @preben01 21 день тому +13

    Do you know of any open source / locally hosted solutions that would achive the same results as those APIs?

    • @TheWiredArts
      @TheWiredArts 2 дні тому

      Firecrawl Self hosted + Gemini Flash free

  • @Techonsapevole
    @Techonsapevole 4 дні тому

    Wow agenQL is impressive!

  • @riztube
    @riztube 19 днів тому +1

    This video is gold!

  • @davidwylie8491
    @davidwylie8491 26 днів тому +4

    Amazing. Thanks for sharing

  • @Rakibrown111
    @Rakibrown111 7 днів тому

    We’ve just created an LLM based scraper

  • @torreydev
    @torreydev 27 днів тому +23

    Using an LLM for this means that you are paying each time you scrape the data. Writing a script might have a larger upfront cost but should be cheaper long term. Sure you might say that when the website is changed you will have to refactor your scrapper, but I'd guess that you would have to do the same for your LLM based scrapper.

    • @AlexanderShelestov
      @AlexanderShelestov 26 днів тому +3

      Imagine you need to scrap thousand of real estate typical websites everyday.

    • @ashleigh3021
      @ashleigh3021 26 днів тому +2

      LLM cost will be lower long term, unless you require absolutely huge scale

    • @sentry404.
      @sentry404. 26 днів тому +7

      I've solved this with a self-maintaining crawler. It's been a bitch to do but I run it once a day on a small number of urls (scraping about 500k urls rn, 20 llm calls per maintenance) and it'll evaluate, update query selectors and even build new scripts.

    • @daylight8296
      @daylight8296 26 днів тому

      you do not have to refactor your LLM scraper that much, it handles dynamic content very well and understands json super easily

    • @dylliedutch
      @dylliedutch 26 днів тому

      @@sentry404.this on GitHub?

  • @benschipper
    @benschipper 8 днів тому

    Great Video! Thanks for sharing :)

  • @BaldyMacbeard
    @BaldyMacbeard 27 днів тому +59

    That's sounds like the worst business case ever. Either incredibly slow or expensive.

    • @acters124
      @acters124 26 днів тому +9

      you would be surprised how often businesses forget about these two statistics when it comes to seeing buzzwords like "AI"

    • @TheGuillotineKing
      @TheGuillotineKing 26 днів тому +8

      In some cases you don't care because it runs 24:7 and it's cheaper than a human

    • @jaysonp9426
      @jaysonp9426 25 днів тому +10

      People who say things like this know nothing about business

    • @curiousspirit3947
      @curiousspirit3947 22 дні тому +10

      One important note: some ai scrapers use llms to e.g understand the shape of the data and from there build a mapper that maps an div id to a certain data model. For example. Id=“address-city” to city. They don’t pass 10 gb of data to an llm.
      Llms are good to find the key navigation routes and mapping data.
      And they’re oftentimes not writing code in the best scrapers i have seen. They call code with the right parameters.
      Websites are very repetitive. You can spend 5$ and have all the information needed to scrape craiglist. Once you do, you don’t need the llm anymore.
      However in this video it does sound like people are shoving giant pieces of text to llms

    • @MRX-ff4vy
      @MRX-ff4vy 16 днів тому

      @@curiousspirit3947Can you name / recommend some ai scraper that do exactly that?

  • @therammync
    @therammync 26 днів тому +1

    Good info! Thanks! Really appreciate if you slow down little bit

  • @WildGamerYoutube
    @WildGamerYoutube 14 днів тому +2

    Jina is amazing! Will definitely use it.

  • @ankitrav
    @ankitrav 22 дні тому

    Brilliant stuff!

  • @attilavass6935
    @attilavass6935 22 дні тому +1

    How about using proxies for scraping jobs? Which of the mentioned tools have the best proxy pool integration?

  • @AlfredNutile
    @AlfredNutile 19 днів тому +1

    Great video Thanks!

  • @HaiLeQuang
    @HaiLeQuang 26 днів тому +2

    Does the cost justify? AgentQL allows 15k API call for $99 per month. That's not much

  • @nitzanbegger6250
    @nitzanbegger6250 25 днів тому +2

    claude "compute use" can't do this by itself now?

  • @j2csharp
    @j2csharp 26 днів тому +2

    How do you guys feel about using Anthropic's Computer Use product to do web scraping?

    • @Nadia-AIInsiders
      @Nadia-AIInsiders 26 днів тому +4

      It's currently very expensive and not reliable. One major issue with these visually-driven models is their vulnerability to prompt injection. As a website owner, you could add something like 'forget all previous instructions' to prevent scraping and maybe even have a little fun with it :)

  • @Mike-ts3kg
    @Mike-ts3kg 26 днів тому +4

    What's the legalities with scraping? Are we able to provide a service that is taking data from another company like this or do they just not care?

    • @ExTorvo
      @ExTorvo 26 днів тому

      historically linkedin has some famous cases but thats the only case i am aware. Of course, now that we know for sure most AI models are based from scraping we have other cases from that...

    • @xlretard
      @xlretard 26 днів тому

      I'm pretty sure new agent systems could be considered malware, if not user directed 🤔

    • @Python_Scott
      @Python_Scott 26 днів тому +4

      I think if a human and Read it and take Notes for free,
      SO should an AI on behalf of humans. ----- they just remember better if trained on it.

    • @neelabhsomani5129
      @neelabhsomani5129 21 день тому

      Its a bit of a grey area. But usually websites have a robot.txt file that outline guidelines on scraping data.

  • @13taras
    @13taras 23 дні тому

    HI, Jason. Can you please do a video on finetuning a vision model?

  • @ex3aliber
    @ex3aliber 27 днів тому

    Insane🎉🎉🎉🎉 love it

  • @moamber1
    @moamber1 18 днів тому +4

    I have a better idea. I'll give o1 the HTML of the page and result from firecrawl, and ask it to replicate the parsing function. This way I won't pay per page.

  • @sribastavrajguru304
    @sribastavrajguru304 19 днів тому

    This is insane😮😮😮😮

  • @ItachiUchiha-tu7ir
    @ItachiUchiha-tu7ir 9 годин тому

    Can someone explain how creating an agentic solution for scraping is different than writing a playwright script? Since for AgentQL it seemed we were using the web elements and wrote a playwright script in the end, so confused what AgentQL is doing in that use-case...

  • @passportmarc
    @passportmarc 25 днів тому

    Amazing stuff man ! learned a ton !

  • @raymondaxyz
    @raymondaxyz 27 днів тому

    Amazing 🤩

  • @knowledgwithMAB
    @knowledgwithMAB 14 днів тому

    Bro what about Agent zero. It can be used for scraping and getting information. And it do it very well

  • @tiagoafonso2971
    @tiagoafonso2971 27 днів тому

    Would love to know how you would leverage the power of AI scraping in website that use older tech like php or asp

    • @KJM3SMG
      @KJM3SMG 27 днів тому +2

      huh? that is on server end. scraping is on the front end.

  • @Gome.o
    @Gome.o 24 дні тому

    You're based in Sydney Australia?

  • @JohnMcclaned
    @JohnMcclaned 26 днів тому +2

    Using llm's to scrape ui is horrifically inefficient lmao.

  • @NLPprompter
    @NLPprompter 26 днів тому

    @AIJasonZ Jason do you know Microsoft omniparser model? what do you think building scraping agent on top if it?

  • @nothingtoseehere5760
    @nothingtoseehere5760 День тому

    Open ai is not an option due to restrictive terms of service, do you know equivalent open source models for these tasks? Many thanks!

  • @vVulkan
    @vVulkan 11 днів тому

    Is there a way to copy a website ? That should be open-source?

  • @fev4
    @fev4 21 день тому

    where would you deploy this in order to have a recurrent jobs?

  • @SaadKhanAhmed
    @SaadKhanAhmed 27 днів тому +1

    Awesome stuff!

  • @DenizAlbayrak-c3r
    @DenizAlbayrak-c3r 19 годин тому

    can we get the backend files?

  • @TheGreyMotion
    @TheGreyMotion 27 днів тому +3

    an entrly level "expert" for 5-10 bucks an hour and the firs model shown was 4o. sorry thought it funny

  • @j.c-mtl1150
    @j.c-mtl1150 5 днів тому

    what about captchas?

  • @edoardododoguzzi
    @edoardododoguzzi 14 днів тому

    Why not use browserless?

  • @Y.AndreaRusso
    @Y.AndreaRusso 26 днів тому +1

    so at the end of the day all of these require python / some technical ability?

    • @DYORNFA
      @DYORNFA 22 дні тому

      Yes but the barrier to entry is decreasing rapidly, as demonstrated by this video. The name of the game in 2025 will simply be, ideas, ideas, ideas!

  • @brunonovais8801
    @brunonovais8801 11 днів тому

    Expensive and slow but just use the ref links

  • @jobautomation
    @jobautomation 26 днів тому +2

    Have you seen one of your videos at 2x? 🐈

  • @hfislwpa
    @hfislwpa 27 днів тому +13

    Bro just discovered robotic process automation 😅

    • @khitabjaisinghani340
      @khitabjaisinghani340 26 днів тому +2

      He's a step ahead, he's trying to replace RPA

    • @hfislwpa
      @hfislwpa 26 днів тому +3

      @ if you couldn't tell he is coding a bot... That is RPA

    • @AI.24.7
      @AI.24.7 26 днів тому +3

      👍 RPA is when there is little to no AI involved... 👍
      I like the new Terms 'GUI Agent' best, then 'Computer using AI' then 'UI Agent' then ''Open Code Interpreter' then 'computer-use'
      I guess the industry hasn't standardized on terms yet.
      If it can be done without AI in the loop, it's much faster and cheaper.
      RPA encompasses a lot more than Web Scraping, like web testing, etc.

    • @SailGoldExplore
      @SailGoldExplore 26 днів тому +2

      Well, it's similar...

  • @jenjerx
    @jenjerx 11 днів тому

    Wait till cloudflair introduces a scrap proof service that’s overpriced and inefficient…

  • @marc-speaks
    @marc-speaks 12 днів тому

    .env is bad practice, especially in Python, more especially in VENV.

    • @smthngsmthngsmthngdarkside
      @smthngsmthngsmthngdarkside 6 днів тому

      I wouldn't take anything any of these ai people do in their python projects as good practice.

  • @sharukhrahman7925
    @sharukhrahman7925 20 днів тому

    Expensive

  • @arduinoguru7233
    @arduinoguru7233 11 днів тому

    *Misleading title*

  • @gangs0846
    @gangs0846 27 днів тому +1

    This works on dynamic JavaScript websites?

  • @ordinarygg
    @ordinarygg 27 днів тому +5

    So you are saying you are smarter than most companies using 50% eng resources to scrap correct data? I think you are dreaming) if you want to make sure you scrape 100% data your approach is the worst.
    99% cases guys just build a custom scrape script, this AI html to text solutions are not reliable if you need actual data

    • @leonsvideos
      @leonsvideos 26 днів тому

      Yeah, if he can automate the writing of such a script that automatically compares against sample data and guarantees correct fetching of correct key value pairs, that would be interesting

  • @codelucky
    @codelucky 26 днів тому

    Can you create a video to do it using the LLM API or have a repo on it?

  • @amandamate9117
    @amandamate9117 12 днів тому +1

    AgentQL is slow af

  • @Andreatuzze
    @Andreatuzze 17 днів тому

    bro you go too fast

  • @ConnectorIQ
    @ConnectorIQ 7 днів тому

    Am I racist or you sound like the guy from Silicon Valley?🙈

  • @EconomistNewsletter
    @EconomistNewsletter 26 днів тому

    boosting AI, what if there are encryption

  • @PyJu80
    @PyJu80 20 днів тому

    I was wondering if you could help me to understand how to just ectract my prompt and responses from say chatgpt or facebook messenger fro instance. Just the chat tho.
    #

  • @soulspawn
    @soulspawn 21 день тому +1

    🖤🔥

  • @VaibhavShewale
    @VaibhavShewale 26 днів тому +1

    users of jina after this video 💹💹