Man! I have been loooking for your content but forgot your account name, this finally came up! Last year I asked you what I needed to learn to get to your level coming from little programming experience and you said start with prompt engineering, I have since gotten certificates from Vanderbilt in prompt engineering, Certificate from Harvard Online in Python and Probability, and working towards a data science. And I FINALLY can follow your stuff, except now I see JavaScript (doh!!!!!) do I need to learn JavaScript or is the JSON library in Python enough? Also, your content is so dang good, I recommend adding some catchy tune, or some fancy logo to remind people of your channel. So for branding, Don’t do it RIGHT at the start, that’s always for the hook, but right after the hook adding maybe a short 2-3 second jingle, and a cheery “I AM JSON AI, let’s get started” or something catchy that reminds us of your channel each time. Anyways, just a suggestion, keep putting out these awesome videos man!
@@smthngsmthngsmthngdarkside Jingle bells jingle bells jingle all the way.... I guarantee you that if he adds a 1-2 second tune intro, marketing and putting his name out there for people to remember, but his content is still THIS good, you would NOT unsuscribe. Haha... if you would, then I have no words. lol
Thats a nice comment, but it's based on a false premise that most falsely believe because of ignorance. The 'dead know nothing' so they 'sleep in the grave' until the 2nd coming . (time passes instantly when you sleep) then all the saved will rise into the air to meet Jesus at the same time (the 1st resurection) -- almost ('the dead in christ will rise first, then the living' ) Then after the saved have been in heaven for 1000 years, the 2nd resurection happens -- all the lost. they are judged and thrown into the lake of fire. The word is very clear on all this if you study. There's more at the 1000 year mark, and 10,000 year mark, but don't want to preach here.
@@ScottzPlaylists It's seems nicer to know that you go there together, and right now , they sleep. They don't have to watch the horrors if this earth. The truth is better than the lie. So spirits are de-mo-ns trying to deceive us. They can appear and speak, act, look, exactly like the dead. After all, thy were present their whole life, trying to temp, and deceive. The D's know us than any human, plus the've had thousands of years of practice and observation. Everyone has an Angel and a D assigned.
Using an LLM for this means that you are paying each time you scrape the data. Writing a script might have a larger upfront cost but should be cheaper long term. Sure you might say that when the website is changed you will have to refactor your scrapper, but I'd guess that you would have to do the same for your LLM based scrapper.
I've solved this with a self-maintaining crawler. It's been a bitch to do but I run it once a day on a small number of urls (scraping about 500k urls rn, 20 llm calls per maintenance) and it'll evaluate, update query selectors and even build new scripts.
One important note: some ai scrapers use llms to e.g understand the shape of the data and from there build a mapper that maps an div id to a certain data model. For example. Id=“address-city” to city. They don’t pass 10 gb of data to an llm. Llms are good to find the key navigation routes and mapping data. And they’re oftentimes not writing code in the best scrapers i have seen. They call code with the right parameters. Websites are very repetitive. You can spend 5$ and have all the information needed to scrape craiglist. Once you do, you don’t need the llm anymore. However in this video it does sound like people are shoving giant pieces of text to llms
It's currently very expensive and not reliable. One major issue with these visually-driven models is their vulnerability to prompt injection. As a website owner, you could add something like 'forget all previous instructions' to prevent scraping and maybe even have a little fun with it :)
historically linkedin has some famous cases but thats the only case i am aware. Of course, now that we know for sure most AI models are based from scraping we have other cases from that...
I have a better idea. I'll give o1 the HTML of the page and result from firecrawl, and ask it to replicate the parsing function. This way I won't pay per page.
Can someone explain how creating an agentic solution for scraping is different than writing a playwright script? Since for AgentQL it seemed we were using the web elements and wrote a playwright script in the end, so confused what AgentQL is doing in that use-case...
👍 RPA is when there is little to no AI involved... 👍 I like the new Terms 'GUI Agent' best, then 'Computer using AI' then 'UI Agent' then ''Open Code Interpreter' then 'computer-use' I guess the industry hasn't standardized on terms yet. If it can be done without AI in the loop, it's much faster and cheaper. RPA encompasses a lot more than Web Scraping, like web testing, etc.
So you are saying you are smarter than most companies using 50% eng resources to scrap correct data? I think you are dreaming) if you want to make sure you scrape 100% data your approach is the worst. 99% cases guys just build a custom scrape script, this AI html to text solutions are not reliable if you need actual data
Yeah, if he can automate the writing of such a script that automatically compares against sample data and guarantees correct fetching of correct key value pairs, that would be interesting
I was wondering if you could help me to understand how to just ectract my prompt and responses from say chatgpt or facebook messenger fro instance. Just the chat tho. #
You've made 2024 my most productive year. Between your demos and cursor anything's possible. Thank you!
Thanks for doing the dirty work and doing a comprehensive comparison!
wow agentQL is nuts!!
Man! I have been loooking for your content but forgot your account name, this finally came up! Last year I asked you what I needed to learn to get to your level coming from little programming experience and you said start with prompt engineering, I have since gotten certificates from Vanderbilt in prompt engineering, Certificate from Harvard Online in Python and Probability, and working towards a data science. And I FINALLY can follow your stuff, except now I see JavaScript (doh!!!!!) do I need to learn JavaScript or is the JSON library in Python enough?
Also, your content is so dang good, I recommend adding some catchy tune, or some fancy logo to remind people of your channel. So for branding, Don’t do it RIGHT at the start, that’s always for the hook, but right after the hook adding maybe a short 2-3 second jingle, and a cheery “I AM JSON AI, let’s get started” or something catchy that reminds us of your channel each time.
Anyways, just a suggestion, keep putting out these awesome videos man!
Add jingles and I will unsub.
@@smthngsmthngsmthngdarkside Jingle bells jingle bells jingle all the way....
I guarantee you that if he adds a 1-2 second tune intro, marketing and putting his name out there for people to remember, but his content is still THIS good, you would NOT unsuscribe. Haha... if you would, then I have no words. lol
Alan Turing is Smiling in heaven
Thats a nice comment, but it's based on a false premise that most falsely believe because of ignorance.
The 'dead know nothing' so they 'sleep in the grave' until the 2nd coming . (time passes instantly when you sleep)
then all the saved will rise into the air to meet Jesus at the same time (the 1st resurection)
-- almost ('the dead in christ will rise first, then the living' )
Then after the saved have been in heaven for 1000 years, the 2nd resurection happens -- all the lost.
they are judged and thrown into the lake of fire. The word is very clear on all this if you study.
There's more at the 1000 year mark, and 10,000 year mark, but don't want to preach here.
@@ScottzPlaylists You know your Bible!! Thanks.
@@ScottzPlaylists Good to know... Thanks.
@@ScottzPlaylists Straight Truth -- I like it.
@@ScottzPlaylists It's seems nicer to know that you go there together, and right now , they sleep.
They don't have to watch the horrors if this earth.
The truth is better than the lie. So spirits are de-mo-ns trying to deceive us. They can appear and speak, act, look, exactly like the dead. After all, thy were present their whole life, trying to temp, and deceive.
The D's know us than any human, plus the've had thousands of years of practice and observation.
Everyone has an Angel and a D assigned.
Great video. Do you recommend doing some kind of pattern replace on the markdown before it goes into the AI API, to get the character count down?
Do you know of any open source / locally hosted solutions that would achive the same results as those APIs?
Firecrawl Self hosted + Gemini Flash free
Wow agenQL is impressive!
This video is gold!
Amazing. Thanks for sharing
We’ve just created an LLM based scraper
Using an LLM for this means that you are paying each time you scrape the data. Writing a script might have a larger upfront cost but should be cheaper long term. Sure you might say that when the website is changed you will have to refactor your scrapper, but I'd guess that you would have to do the same for your LLM based scrapper.
Imagine you need to scrap thousand of real estate typical websites everyday.
LLM cost will be lower long term, unless you require absolutely huge scale
I've solved this with a self-maintaining crawler. It's been a bitch to do but I run it once a day on a small number of urls (scraping about 500k urls rn, 20 llm calls per maintenance) and it'll evaluate, update query selectors and even build new scripts.
you do not have to refactor your LLM scraper that much, it handles dynamic content very well and understands json super easily
@@sentry404.this on GitHub?
Great Video! Thanks for sharing :)
That's sounds like the worst business case ever. Either incredibly slow or expensive.
you would be surprised how often businesses forget about these two statistics when it comes to seeing buzzwords like "AI"
In some cases you don't care because it runs 24:7 and it's cheaper than a human
People who say things like this know nothing about business
One important note: some ai scrapers use llms to e.g understand the shape of the data and from there build a mapper that maps an div id to a certain data model. For example. Id=“address-city” to city. They don’t pass 10 gb of data to an llm.
Llms are good to find the key navigation routes and mapping data.
And they’re oftentimes not writing code in the best scrapers i have seen. They call code with the right parameters.
Websites are very repetitive. You can spend 5$ and have all the information needed to scrape craiglist. Once you do, you don’t need the llm anymore.
However in this video it does sound like people are shoving giant pieces of text to llms
@@curiousspirit3947Can you name / recommend some ai scraper that do exactly that?
Good info! Thanks! Really appreciate if you slow down little bit
Slow the speed of the video down.
Jina is amazing! Will definitely use it.
Brilliant stuff!
How about using proxies for scraping jobs? Which of the mentioned tools have the best proxy pool integration?
Great video Thanks!
Does the cost justify? AgentQL allows 15k API call for $99 per month. That's not much
claude "compute use" can't do this by itself now?
How do you guys feel about using Anthropic's Computer Use product to do web scraping?
It's currently very expensive and not reliable. One major issue with these visually-driven models is their vulnerability to prompt injection. As a website owner, you could add something like 'forget all previous instructions' to prevent scraping and maybe even have a little fun with it :)
What's the legalities with scraping? Are we able to provide a service that is taking data from another company like this or do they just not care?
historically linkedin has some famous cases but thats the only case i am aware. Of course, now that we know for sure most AI models are based from scraping we have other cases from that...
I'm pretty sure new agent systems could be considered malware, if not user directed 🤔
I think if a human and Read it and take Notes for free,
SO should an AI on behalf of humans. ----- they just remember better if trained on it.
Its a bit of a grey area. But usually websites have a robot.txt file that outline guidelines on scraping data.
HI, Jason. Can you please do a video on finetuning a vision model?
Insane🎉🎉🎉🎉 love it
I have a better idea. I'll give o1 the HTML of the page and result from firecrawl, and ask it to replicate the parsing function. This way I won't pay per page.
Isn't o1 like 25 requests a week?
This is insane😮😮😮😮
Can someone explain how creating an agentic solution for scraping is different than writing a playwright script? Since for AgentQL it seemed we were using the web elements and wrote a playwright script in the end, so confused what AgentQL is doing in that use-case...
Amazing stuff man ! learned a ton !
Amazing 🤩
Bro what about Agent zero. It can be used for scraping and getting information. And it do it very well
"Bro"
Would love to know how you would leverage the power of AI scraping in website that use older tech like php or asp
huh? that is on server end. scraping is on the front end.
You're based in Sydney Australia?
Using llm's to scrape ui is horrifically inefficient lmao.
@AIJasonZ Jason do you know Microsoft omniparser model? what do you think building scraping agent on top if it?
Open ai is not an option due to restrictive terms of service, do you know equivalent open source models for these tasks? Many thanks!
Is there a way to copy a website ? That should be open-source?
where would you deploy this in order to have a recurrent jobs?
Awesome stuff!
can we get the backend files?
an entrly level "expert" for 5-10 bucks an hour and the firs model shown was 4o. sorry thought it funny
what about captchas?
Why not use browserless?
so at the end of the day all of these require python / some technical ability?
Yes but the barrier to entry is decreasing rapidly, as demonstrated by this video. The name of the game in 2025 will simply be, ideas, ideas, ideas!
Expensive and slow but just use the ref links
Have you seen one of your videos at 2x? 🐈
JIN YANG!
Bro just discovered robotic process automation 😅
He's a step ahead, he's trying to replace RPA
@ if you couldn't tell he is coding a bot... That is RPA
👍 RPA is when there is little to no AI involved... 👍
I like the new Terms 'GUI Agent' best, then 'Computer using AI' then 'UI Agent' then ''Open Code Interpreter' then 'computer-use'
I guess the industry hasn't standardized on terms yet.
If it can be done without AI in the loop, it's much faster and cheaper.
RPA encompasses a lot more than Web Scraping, like web testing, etc.
Well, it's similar...
Wait till cloudflair introduces a scrap proof service that’s overpriced and inefficient…
.env is bad practice, especially in Python, more especially in VENV.
I wouldn't take anything any of these ai people do in their python projects as good practice.
Expensive
*Misleading title*
This works on dynamic JavaScript websites?
Sure does
So you are saying you are smarter than most companies using 50% eng resources to scrap correct data? I think you are dreaming) if you want to make sure you scrape 100% data your approach is the worst.
99% cases guys just build a custom scrape script, this AI html to text solutions are not reliable if you need actual data
Yeah, if he can automate the writing of such a script that automatically compares against sample data and guarantees correct fetching of correct key value pairs, that would be interesting
Can you create a video to do it using the LLM API or have a repo on it?
AgentQL is slow af
bro you go too fast
Am I racist or you sound like the guy from Silicon Valley?🙈
boosting AI, what if there are encryption
I was wondering if you could help me to understand how to just ectract my prompt and responses from say chatgpt or facebook messenger fro instance. Just the chat tho.
#
🖤🔥
users of jina after this video 💹💹