Hi! Great question. For most cases, just go with default scraping through the website's structure! For this specific case, it's because of it being an open-source project, so I don't think it would be harmful to scrape from it for demonstration. But mainly, because many people aren't able to scrape it from the html structure due to not knowing how to code, and could benefit initially by just using a llm. The ideal scenario of using an llm is combining the crawl aspect. Because while crawling, you would be reaching for data with an unpredictable structure throughout the websites.
well fuck yeah. ALL your data goes to chinese servers, used later for training. is cheaper to make people give you data for free, than build yourself complicated worflows :)
It wouldn't be exactly "your" data if it's scraping from a public website that the llm might already have training on anyway. But it's a nice security measure to take in consideration!
Man, you did a great job. Wishing you more subs and views, this deserves more recognition
I appreciate that very much! 😁 Thanks, and all luck with your projects!
I understand this is just an example but, with the site you're scraping being so predictable, why do you need to use an LLM to scrape it?
Hi! Great question. For most cases, just go with default scraping through the website's structure!
For this specific case, it's because of it being an open-source project, so I don't think it would be harmful to scrape from it for demonstration.
But mainly, because many people aren't able to scrape it from the html structure due to not knowing how to code, and could benefit initially by just using a llm.
The ideal scenario of using an llm is combining the crawl aspect. Because while crawling, you would be reaching for data with an unpredictable structure throughout the websites.
well fuck yeah. ALL your data goes to chinese servers, used later for training. is cheaper to make people give you data for free, than build yourself complicated worflows :)
It wouldn't be exactly "your" data if it's scraping from a public website that the llm might already have training on anyway.
But it's a nice security measure to take in consideration!