I built my own Reddit API to beat Inflation. Web Scraping for data collection.

Поділитися
Вставка
  • Опубліковано 11 чер 2024
  • The only way for us cash strapped developers to make it in this economy!
    In this video, I decide to create my own version of the Reddit API for as cheap as possible (whilst still remaining cloud hosted). We look at how I gathered the data, and how I built a simple, yet affordable, data pipeline, and finally, a usage based API which costs me pennies, rather than hundreds of dollars.
    This video was sponsored by BrightData. To signup for Brightdata and get $15 credit to build your own web scrapers, use the following link brdta.com/dreamsofcode
    You can find the source code for this project on GitHub at the link below
    github.com/dreamsofcode-io/re...
    Become a better developer in 4 minutes: bit.ly/45C7a29 👈
    Join this channel to get access to perks:
    / @dreamsofcode
    My socials:
    Discord: / discord
    Twitter: / dreamsofcode_io
    00:00 Intro
    01:42 Web Scraping
    06:58 Message Queue
    10:19 BrightData
    13:23 Deploy to AWS Lambda
    14:18 DynamoDB
    15:32 API
    18:05 Final Cost
  • Наука та технологія

КОМЕНТАРІ • 286

  • @dreamsofcode
    @dreamsofcode  9 місяців тому +40

    To get $15 credit for use with Brightdata to scrape your own APIs, visit: brdta.com/dreamsofcode

    • @meinkanal13378
      @meinkanal13378 7 місяців тому

      Just an info: Not working anymore, only $5

    • @dreamsofcode
      @dreamsofcode  7 місяців тому

      @@meinkanal13378 inflation strikes again 😭
      Let me reach out. Thank you for letting me know

    • @PaulSebastianM
      @PaulSebastianM 6 місяців тому

      Be careful, we scraping is illegal in some countries.

  • @foobars3816
    @foobars3816 7 місяців тому +113

    This was never a technical limitation, it was a legal one.

    • @jgould30
      @jgould30 6 місяців тому +3

      uh, no. It's a financial one. The idea that companies are going to offer network and compute resources for the sheer amount of API calls made for free was always comical. It's sad that so many programmers and general public think this stuff is just free or a charity. No matter what you do, eventually these costs will catch up to the business and HAVE to be charged to people or else the service will just die.

    • @fizzcochito
      @fizzcochito 6 місяців тому

      @@jgould30 I am going to touch you without your consent

    • @Homiloko2
      @Homiloko2 5 місяців тому

      @@jgould30 Yep. People pretend webscraping is 'free', but it still costs the companies. The companies are willing to bear the cost of regular users browsing through pages, but a scraper browsing through the entire catalog is even more expensive for the company than if they just used the API. Scraping is definitely malicious.

    • @tabbytobias2167
      @tabbytobias2167 Місяць тому

      @@jgould30 it costs a server less than a penny to serve 1000 requests.

    • @jameskim7565
      @jameskim7565 Місяць тому

      @@tabbytobias2167 yes, but for a service the size of reddit, it can lead to hundreds of thousands of dollars in losses, due to the sheer volume of those requests.

  • @sivuyilemagutywa5286
    @sivuyilemagutywa5286 9 місяців тому +380

    The video was enjoyable, but it's important to acknowledge that sponsored content can introduce bias. One approach could be to make the entire video centered around the sponsor, or if you choose to feature the sponsor as you did, consider presenting alternative services similar to them. Your videos are consistently excellent, boasting high-quality production, a well-maintained pace, and crystal-clear explanations.

    • @aliengarden
      @aliengarden 7 місяців тому +10

      that was my exact thought, thanks for pointing it out.

    • @seanthesheep
      @seanthesheep 7 місяців тому +18

      when ChatGPT focuses more on the sponsor of the video than the video itself

    • @jaumsilveira
      @jaumsilveira 7 місяців тому +16

      Yeah, bro was talking about make everything as free as possible and then presents a service which is very expensive

    • @hqcart1
      @hqcart1 7 місяців тому +3

      what about captcha?????? he didnt mention that his sponsor can go around it, and even his code did not handle captcha.

    • @TheMacWindows
      @TheMacWindows 7 місяців тому +2

      @@hqcart1 Death by captcha and related services exist for that

  • @shadez221
    @shadez221 9 місяців тому +263

    For anyone planning to try this , use headless mode of puppeteer so that I doesn’t open multiple browser to improve performance and route it via a vpn setup on aws to obfuscate .
    And be ready to have your ip blocked 😊

    • @__sassan__
      @__sassan__ 9 місяців тому +1

      Even when using the VPN?

    • @tacokoneko
      @tacokoneko 9 місяців тому

      vpns also have an ip so when doing this if they block you you need an endless revolving door of new VPNs or proxys @@__sassan__

    • @tacokoneko
      @tacokoneko 9 місяців тому

      which is not that hard because if you port scan the entire internet with some strategic guessing (downloading public datacenter IP ranges, scan port 1080 for SOCKS5 proxys) you can find unsecured proxys for free, even some rare ones that work with SSL over SOCKS5

    • @tacokoneko
      @tacokoneko 9 місяців тому +2

      i asked someone if port scanning the internet to find proxys is illegal and they said no so i think it's completely legal, they didnt put a password or any authentication so they are allowing people to use it

    • @Dot_UwU
      @Dot_UwU 9 місяців тому

      @@__sassan__ if you send a ton of requests with the same IP, you'll get rate limited. Also most VPN ips are datacenter IPs which are almost always blocked.

  • @shishsquared
    @shishsquared 9 місяців тому +454

    Crowdsourcing idea for this to prevent IPs getting blocked: a browser that pays its users for using it. Developers write scripts to scrape data, and pay to use the network of users. Users then get paid for using the web browser, which will create a private session, encrypted away from the user, run the web scraping tasks, and send the data back to the developer. Build it all on top of chromium, and if done correctly, websites would have a very difficult time blocking based on IP addresses, activity , or fingerprinting because it would be distributed across actual user IPs, and actual user login times (browser only runs when open). My only concern would be how to protect the users when malicious devs start doing illegal activities. You'd have to have very strong terms and conditions, have logging, and be able to trace back requests to devs. But then that opens a dev privacy can of worms. Still, interesting concept

    • @phoneywheeze9959
      @phoneywheeze9959 9 місяців тому +347

      Botnet as a Service

    • @levifig
      @levifig 9 місяців тому +148

      You just described 99% of the “VPN” apps available for your mobile device… ;)

    • @MuhsinunChowdhury
      @MuhsinunChowdhury 8 місяців тому +11

      Wouldn't residential sneaker botting proxies be able to accomplish the same thing?

    • @mathisd
      @mathisd 8 місяців тому +3

      @@MuhsinunChowdhury These costs..

    • @ajnart_
      @ajnart_ 8 місяців тому

      ahahahah you're not wrong, especially the free ones@@levifig

  • @IannoOfAlgodoo
    @IannoOfAlgodoo 9 місяців тому +60

    Curious how much you spend on bright data as their product is like 20$ / GB and 0.1/hour

    • @GoldenretriverYT
      @GoldenretriverYT 7 місяців тому +17

      Yeah, its expensive as heck. Also I am wondering how they claim they have 72 million residential ips?
      I can only imagine them having spread malware which then gave them a botnet to work with, or, less likely, they offer people money in exchange for them running a proxy.
      Edit: I looked it up, apparently they have an SDK which app developers can integrate which gives the users a choice between ads or allowing their connection to be used by BrightData as a proxy, thats where they (at least claim to) have the proxies from.

    • @tardistrailers
      @tardistrailers 7 місяців тому +8

      @@GoldenretriverYT It'd be insane to run a resold proxy on your personal IP, just to see no ads somewhere. Worst case you get your home raided by law enforcement, because someone did something highly illegal with it. But I wouldn't be surprised if less educated people still do this.

    • @OrangeYTT
      @OrangeYTT 7 місяців тому

      ​@@GoldenretriverYT99% of "residential proxies" are just computers under a botnet.
      Hola (that free Vpn) got in trouble a while back for making people who used their VPN join their botnet for this very reason!

  • @FunctionGermany
    @FunctionGermany 9 місяців тому +24

    new reddit probably uses an internal API you can pull from by fetching from the browser window. also note another user's comment about old reddit + cheerio (no browser needed).

    • @eoussama
      @eoussama 3 місяці тому

      He probably used Playwright just to have an excuse to shove the Bright Data sponsorship in the vide, which I understand.

  • @conaticus
    @conaticus 9 місяців тому +61

    Really cool project idea! Loved it

  • @forresthopkinsa
    @forresthopkinsa 8 місяців тому +243

    This is an interesting idea but a really impractical approach. New Reddit is an SPA and you can just use the XHR endpoints to fetch the data raw. Don't bother with browser emulation and HTML parsing.
    Besides, the closure of the APIs was never about restricting access to a user like you're circumventing here. As you've acknowledged, that wouldn't really make sense on the Web. The API pricing is about charging for data farming and large-scale user interception. You can't accomplish either of these use cases by scraping; you'll get rate-limited very quickly.
    The only way around this is using Bright Data's borderline-illegal botnet, which seems like a pretty shady way to do business.

    • @tatianatub
      @tatianatub 7 місяців тому +83

      its called hostile interoperability and its the consequence to fucking over developers, its time we remind platform hosts why APIs were created in the first place

    • @mathgeniuszach
      @mathgeniuszach 7 місяців тому +17

      People will use their own embedded browsers and similar scraping methods will occur locally. It's basically the same as an extension modification of the site. People just browsing normally don't need botnets and access to all of reddit, they just want a better stinking interface.

    • @ArizeOW
      @ArizeOW 7 місяців тому

      @@tatianatub It's time to remind you, that Reddit doesn't belong to "us". It belongs to Reddit. And they can do whatever they want with it. If they don't want large applications like Apollo to scrape EVERY post, comment, upvote, downvote, user karma and such, there is nothing you can do about it. That's it. It's not that deep.

    • @DathCoco
      @DathCoco 7 місяців тому

      also if using oldreddit you can simply use jsdom to parse the data without needing to spin up a chromium

    • @x--.
      @x--. 7 місяців тому +7

      The internet is meant to be and should be open.
      That doesn't mean everything has to be free at-scale but fighting hostility to the _idea of an open internet_ is a good thing. You're free to put your content behind a paywall for everyone.

  • @the_cobfather
    @the_cobfather 7 місяців тому +5

    Why use an SQS queue to abstract the db writing interface? The solution that immediately comes to mind is to just make an abstract class.
    The point of SQS is to be able to handle crazy amounts of throughput (like, up to 30,000 messages per second), which isn't really what you're doing.

  • @wierdnes
    @wierdnes 7 місяців тому +28

    Great video. I liked the step by step thought process of getting the scraper get data. One major flaw in the cost analysis you presented was the absence of any cost for brightdata. Checking the pricing myself it looks like 20€ per GB of data?

  • @Jana-se4kv
    @Jana-se4kv 7 місяців тому +2

    THANK YOU!
    Very helpful!

  • @takennmc
    @takennmc 9 місяців тому +55

    8 cents for 3 weeks damn this really makes reddit unreasonable

    • @rockshankar
      @rockshankar 9 місяців тому

      That does come with a significant management. the project is a simple way to get it working. Once you dig deeper there are lots of problems. Lambda and dynamodb is cheaper based on amount of requests. If you post your api endpoint in public. 1 million requests will be gone in seconds. and then using Lambda will make it more expensive than running your server.
      If its cheaper, someone else would have done it already.

  • @teamredstudio7012
    @teamredstudio7012 7 місяців тому +23

    I would do this in a different way. I would simply write a script in whatever language, that has a get and post function so you can call the main page first, then parse the data, often websites use apis already to fetch the content, use Fiddler Classic or some other proxy server to inspect what api the website uses. When the website loads more content after scrolling, it needs to fetch the data from somewhere. Simply reproduce this api by copying the authentication tokens from the headers and providing the required headers in the requests, then parse the response body and add it to some database. I would make it store everything so if it needs to be fetched repeatedly it simply gets from offline copy instead of wasting resources fetching and parsing. I never automate browsers, if your browser can fetch the data, you can fetch it too without front end. You can also get the url to load more content from fetching the raw main page because the browser needs to know where to fetch this anyways so it's definitely defined somewhere. It's super simple to scrape websites, you only need to know how to do requests and parse json and xml in your preferred language! Don't automate browsers but just fetch it directly!

    • @unforgettable31
      @unforgettable31 7 місяців тому +6

      I come from a cracking background and back in the day and this is exactly what we would do. We would write GET/POST requests with token grabbing methods and get the job done. We’d launch hundredths of threads all connected to different proxies, instead of a single web browser. Sometimes it was challenging for particular platforms because of cookies but at the end of the day it was doable.

    • @rossimac
      @rossimac 7 місяців тому

      Websites that use recaptcha2 are ones that I've found that I need a browser to interact with. Ones that don't then yes, totally, inspect the network traffic and understand how your browser is creating the requests and then replicate them.

    • @S0L4RE
      @S0L4RE 7 місяців тому +4

      +1 it’s such a massive pet peeve of mine seeing people use selenium when it could just be achieved with requests.

    • @cheemzboi
      @cheemzboi 7 місяців тому +1

      @@unforgettable31 what about captchas then

    • @unforgettable31
      @unforgettable31 7 місяців тому

      @@cheemzboi Most platforms use captchas when they detect ongoing suspicious activity, which is omitted when using proxies.

  • @WarlordEnthusiast
    @WarlordEnthusiast 7 місяців тому +1

    I actually did something similar, we needed financial data for a project we were working on and the APIs we found were very limiting and some were very expensive.
    We tried using one of the cheaper ones and it straight up did not work, it had downtime of sometimes hours and when we contacted the company they basically told us it wasn't there problem.
    So I built a web scraper, hosted it on my server at home and scraped all the forex data I needed from their website for free.

  • @DodaGarcia
    @DodaGarcia 7 місяців тому +9

    Decoupling the data persistence from the business logic is always a good idea, but using a queue service for that is bonkers. It removes none of the existing complexity, since you still eventually have to map the message payload to the database schema, and then introduces more complexity because you now have to keep track of one more service, the publishing code, the consuming code and the asynchronicity itself.
    Just use the repository pattern with an adapter for the chosen database, or an ORM like Prisma if you really don't expect the app to scale much.

    • @goofynose2520
      @goofynose2520 6 місяців тому

      Agreed. I swear 90% of queues I encounter are needless overcomplications

    • @ShaneZarechian
      @ShaneZarechian 3 місяці тому

      Someone fork this and make it non-ridiculous

  • @dancinglazer1628
    @dancinglazer1628 7 місяців тому +27

    Honestly, I think this infrastructure is too complicated for what it is doing. I don't really care about the sponsored bit, but I think it would have been better to simply create a lambda that directly writes to a database (assume a cacheFactory -> RedisCache | MongoCache | JsonCache) along with a "freshness" param due to the relative simplicity of the data I think redis would be a good candidate; Then all you would need to do in the API is simply fetch the data based on the query param, something which can probably be achieved in a single file.

    • @jp46614
      @jp46614 7 місяців тому +5

      Yeah I feel it's been quite overengineered with all this message queue and database/service stuff, this could be done fully locally realistically and at not much of a bigger cost since nowadays OSS databases and caching solutions are really efficient

    • @hqcart1
      @hqcart1 7 місяців тому +1

      he will need a 2-4GB ram VM to do that. AWS is expensive

    • @dancinglazer1628
      @dancinglazer1628 7 місяців тому +3

      @@hqcart1 he is deffering the scraping to the sponsered service anyway, but I think we can just fetch the html instead of running a headless browser

    • @dancinglazer1628
      @dancinglazer1628 7 місяців тому

      @@jp46614 This could be a single service on a docker image, run a cron scheduler that fetches and writes to a json file and have a server running that uses the json as a database

    • @hqcart1
      @hqcart1 7 місяців тому

      @@dancinglazer1628even he uses a sponsored service, at one point you will get captcha, and my point was his code does not handle that.. and about fetching HTML, no it does not work for complex sites where HTML code or classes is getting rewritten by js, i tried that and failed, ended up using headless browser.

  • @scaffus
    @scaffus 9 місяців тому +1

    Great vid! Love your work

  • @EarlZMoade
    @EarlZMoade 9 місяців тому +5

    Unrelated to this video - would you show how you version your dotfiles (if you do)? It would make for a good video.

  • @sumirandahal76
    @sumirandahal76 9 місяців тому

    Quality project ❤ content worth watching, hooks through the time. 🎉

  • @kale_bhai
    @kale_bhai 7 місяців тому

    Learned about the queing system utilization. But thats pretty much the obly thing new to me.

  • @louishuort7969
    @louishuort7969 8 місяців тому +5

    What about the cost of bright data ?

  • @primo_geniture
    @primo_geniture 9 місяців тому +6

    I'm curious as to what the total time for the project was.

  • @xXtim128Xx
    @xXtim128Xx 7 місяців тому +3

    Using a full webbrowser when a simple HTTP request and HTML parser would suffice...

    • @dreamsofcode
      @dreamsofcode  7 місяців тому +1

      You're correct. It would have. However a browser is a more versatile option for other use cases.

  • @-Siknakaliux-II
    @-Siknakaliux-II 7 місяців тому

    So this vid popped up in my recs. Unrelated off-topic comment, but I remember getting into a programming phase in grade 6-7. I've pretty much obsessed over the thought of doing something great with it. Got myself to do a few courses but never really stuck on as ive moded onto Finance. Now I kinda wanna get into it again as I did in the past...

  • @TheHotMrDuck
    @TheHotMrDuck 7 місяців тому +5

    i hope this doesnt kill old reddit, if they remove it im gone

  • @nigerianprince5389
    @nigerianprince5389 5 місяців тому

    1st off, thanks for this buddy, you're a godsend.
    it does feel a bit over-engineered but i guess you've gone this route because you want to build your own Reddit API.
    for folks like me who have only been coding everyday for 1 month using GPT - knowing how to pull the data from reddit and store in a database is the main thing i need (i think most people as well but i could be wrong).
    keep up the good work still and thank you again !

  • @antonjoacir
    @antonjoacir 8 місяців тому +2

    Man, could you make a video about the configurations of your terminal?

  • @socks5proxy
    @socks5proxy 7 місяців тому

    absolutely brilliant video. so very well done.

    • @dreamsofcode
      @dreamsofcode  7 місяців тому

      Thank you! I'm glad you enjoyed it!

  • @jerryaugusto95
    @jerryaugusto95 9 місяців тому +1

    Is it just me or are the icons for the Go files different? How do you change these icons please?

  • @poggybitz513
    @poggybitz513 7 місяців тому +1

    I did the same thing for my app using selenium bindings in rust and used vagrant to manage instances. You can use docker if you want. Please mark this video as ad, because none in their right mind would do it this way. I am so tired of people shoving ads down my throat and claiming its a good education.

  • @veshal.s3690
    @veshal.s3690 8 місяців тому +1

    Would love a post on your powerlevel10k config and your terminal config

  • @chofmann
    @chofmann 9 місяців тому +5

    you are aware of the json api that things like rif is using? basically, for every link, there is also a json file you can just access

  • @ltecheroffical
    @ltecheroffical Місяць тому

    You can remove the browser part by using a web scraping framework that works without a browser instance.

  • @glitchy_weasel
    @glitchy_weasel 7 місяців тому

    Fantastic! Very informative, always nice to stick it to big tech lol

  • @sworatex1683
    @sworatex1683 7 місяців тому +2

    Why didnt you use curl? It would bei way more lightweight than using a Browser. Most Programming languages will let you manage Dom objects with built in libraries

  • @christianjedro6206
    @christianjedro6206 7 місяців тому +1

    How do you avoid vendor/database lock in by using AWS SQS?!

  • @5criptcom
    @5criptcom 7 місяців тому

    Good one sir!

  • @jondoe79
    @jondoe79 9 місяців тому +2

    Great content, real examples of use case for different tools for a simple but useful project.

  • @pelic9608
    @pelic9608 7 місяців тому +3

    Every modern website has an API.
    Most just aren't documented. 🤷‍♂️
    Copy their own website's auth flow and use those tokens to drive your app. Wjat are they gonna do? Paywall their entire site?
    (Ok, ok; SSR is a thing, but there's still almost always some pure-data endpoint around)

  • @k98killer
    @k98killer 8 місяців тому +3

    Would it have cost more without the brightdata sponsorship?

    • @louishuort7969
      @louishuort7969 8 місяців тому +2

      Ohh yes, a lot, bright data is very expensive

  • @cooperqmarshall
    @cooperqmarshall 9 місяців тому +1

    The quality of this project is supreme their. Love the detail and consideration for the infrastructure

  • @zack_beard
    @zack_beard 6 місяців тому

    Great content! Quick question. Did you do this after logging into to Reddit with your userid/pwd o without? IIRC Reddit does not show new content if you are not logged in. Thanks!

    • @dreamsofcode
      @dreamsofcode  6 місяців тому

      Thank you!
      Logged out, which causes it to fall under publically accessible. Reddit still shows content on the old reddit website under the /new when you're not logged it.

  • @creeperlolthetrouble
    @creeperlolthetrouble 7 місяців тому +1

    xD i've seen this coming for months but why not keep AWS and tunnel the requests through a proxy

  • @rando521
    @rando521 9 місяців тому +2

    hi dreams i love your vids on vim and tried it on my own due to them
    while trying c++ i want to know if there is a better option than cmake?
    i come from python so i plan on rpc-ing the python part and move to mostly c++ or golang any ideas on how to do this?

    • @FaZekiller-qe3uf
      @FaZekiller-qe3uf 9 місяців тому +2

      The better option is to use a language with good tooling. Zig, Rust, Go, etc. cmake L, Make L.

    • @jacksonsmith4648
      @jacksonsmith4648 9 місяців тому

      Meson! It's basically CMake, but with syntax similar to python, and a lot less stupid design decisions. Definitely worth a look.

    • @S0L4RE
      @S0L4RE 7 місяців тому

      @@jacksonsmith4648why are we hating on cmake?

  • @shadyworld1
    @shadyworld1 7 місяців тому +1

    If you could use RSS to pull the data and store them in a proper format to be used for API you’ll be able to save 40% at least of your current approach time and effort!

  • @jasontruter7239
    @jasontruter7239 7 місяців тому

    Good job, one improvement would be to go with a single table design with DynamoDb

  • @jakestrouse12
    @jakestrouse12 7 місяців тому +11

    You can also reverse engineer their private api by looking at the browser network requests. The scraping will be much faster

    • @S0L4RE
      @S0L4RE 7 місяців тому

      Although Cloudflare IUAM makes it an immense pain in the ass

    • @batmanatkinson1188
      @batmanatkinson1188 7 місяців тому

      And keep in mind that private APIs are susceptible to change, so today it’s gonna work, tomorrow you have to start over

    • @unaif.2171
      @unaif.2171 7 місяців тому

      ​@@batmanatkinson1188less often than the html

    • @TheSaintsVEVO
      @TheSaintsVEVO 7 місяців тому

      @@S0L4REwhat’s that? Does Reddit use it?

    • @S0L4RE
      @S0L4RE 7 місяців тому

      @@TheSaintsVEVO I’m not sure if Reddit uses it, but IUAM detects very low-level characteristics about the request (i.e cipher mode, SSL configuration) to determine whether it looks automated.

  • @dandandev
    @dandandev 9 місяців тому +1

    Heya! I'd recommend Railway to host your apps, its usage based and pretty cheap!

  • @filiprandom
    @filiprandom 2 місяці тому

    I watched this video for 4 hours because it was on repeat and I fell asleep

  • @grif5307
    @grif5307 9 місяців тому

    One of my favourite videos in a while, great job!!!!

  • @EarlZMoade
    @EarlZMoade 9 місяців тому +3

    Are there any issues with legality when using the data you extract? I.e. could you use the data for commercial purposes, or research?

    • @ristekostadinov2820
      @ristekostadinov2820 9 місяців тому +7

      Microsoft i think have taken someone to court for web scraping and won, i think it was a company that were scraping linkedin public data from users and were building their own app for recruiting people and microsoft were arguing that the users didn't consent to that (which is true, but then again data is public). So it's a very tricky problem, and is best to read websites terms & service.

  • @stylrart
    @stylrart 7 місяців тому

    Nice you are using JB Mono, like me.
    what theme are you using, the colors are handsome ;)

  • @heckerhecker8246
    @heckerhecker8246 7 місяців тому +1

    How to get four hitmen at your door:

  • @sheldonsays9922
    @sheldonsays9922 6 місяців тому

    How long did it actually take for you to complete this project.

  • @pchris
    @pchris 7 місяців тому +2

    Would something like this work for third-party applications like Reddit Apollo?

    • @CrazyWinner357
      @CrazyWinner357 7 місяців тому

      It can work... until you get a captcha

  • @Meleeman011
    @Meleeman011 5 місяців тому

    why do you use playwright and not just puppeteer?

  • @JoshIbbotson-
    @JoshIbbotson- 7 місяців тому

    How long have you been programming? Loved this video btw!

    • @dreamsofcode
      @dreamsofcode  7 місяців тому

      Thank you! I've been writing code since 2008.

  • @techwithjoe8636
    @techwithjoe8636 7 місяців тому

    Which Editor is he using? Vim?

  • @TrueDetectivePikachu
    @TrueDetectivePikachu 7 місяців тому

    Genuine question, why use puppeteer that relies on an active browser and not something like cheerio?

    • @dreamsofcode
      @dreamsofcode  7 місяців тому +1

      It's a great question. Cheerio would work really well in this case as there was little to no javascript for the old version of reddit. Initially I wanted to go with the new reddit so had scoped out using an active browser (which I think has more application beyond reddit). Cheerio is always preferable in a case with no javascript, but it's not as applicable as puppeteer is. TLDR is that I wanted to showcase active browser scraping in the video.

  • @ahwx
    @ahwx 9 місяців тому +1

    I see you're using a Mac now, what terminal is that? How are your rounded window corners so much less rounded that mine? Have you changed anything?

  • @siniarskimar
    @siniarskimar 9 місяців тому

    How about developing a browser extension for "enhancing" reddit that would additionaly scrape any post that user sees 🤔

  • @houstonbova3136
    @houstonbova3136 8 місяців тому

    DataStore and FireStore work roughly the same as Dynamo, no?

  • @vekoze9872
    @vekoze9872 7 місяців тому

    what is the tmux font ?

  • @iamrafiqulislam
    @iamrafiqulislam 7 місяців тому

    what is the Font you are using for Nvim and tmux status bar, please?

    • @dreamsofcode
      @dreamsofcode  7 місяців тому +1

      I am using JetBrainsMono Nerd Font! I have a video on both of my Nvim and tmux configs on my channel :)

  • @betapacket
    @betapacket 7 місяців тому

    2:02 isn't playright yet another ECM and not a web scraper?

  • @edanbigw
    @edanbigw 8 місяців тому

    sorry oot, did you use mac sir?

  • @metalspoon69
    @metalspoon69 9 місяців тому +16

    "Just build your own API"
    *builds own API*
    "NOO NOT LIKE THAT!!!!"

  • @mx338
    @mx338 6 місяців тому

    DynamoDB isn't really low cost, so I would definitely look into switching to ScyllaDB which offers a DynamoDB compatible API.

  • @navaneeth6157
    @navaneeth6157 9 місяців тому

    chromedp for golang is also an option

  • @willmil1199
    @willmil1199 5 місяців тому

    How do we use your api then ?

  • @qCJLbggG4IWAY9nTH6o
    @qCJLbggG4IWAY9nTH6o 8 місяців тому

    why not use their rss feed?

  • @Shudshudu
    @Shudshudu 7 місяців тому

    Sir am learning c and am new to programming. Currently am learning control structure. But when i look into real world projects I don’t understand anything why

    • @user-hy6cp6xp9f
      @user-hy6cp6xp9f 7 місяців тому +1

      It takes time! Also C is a VERY different level of abstraction than Javascript / Go like he used here.

  • @user-nr1qk6oi7g
    @user-nr1qk6oi7g 7 місяців тому

    if you used python you could easily bypass ip blocking with torpy

  • @Puwunda
    @Puwunda 7 місяців тому

    Intercontinental Lawsuit Inbound!!!

  • @guillemgarcia3630
    @guillemgarcia3630 9 місяців тому +2

    jesus there's more terraform configuration than code

  • @hemant_san
    @hemant_san 7 місяців тому

    how to bypass capctha?

  • @robinbinder8658
    @robinbinder8658 7 місяців тому

    boi do i smell a cease and desist

  • @Dev-Siri
    @Dev-Siri 9 місяців тому +6

    tip: bun 1.0 has been released just last day, and you can use it as a drop-in-replacement for node.
    it executes js much faster, without breaking anything so it can magically make your api faster. for deployment, you need to use a docker image because its still very early and not supported by any platforms (yet)

    • @ac130kz
      @ac130kz 9 місяців тому

      it just get stuck if I try to run puppeteer with whatsapwebjs, yeah, fast and cool, but too early

  • @_Mackan
    @_Mackan 8 місяців тому

    virgin api consumer vs chad scraper

  • @reihanboo
    @reihanboo 7 місяців тому

    didn't understand anything but great video

  • @flor.7797
    @flor.7797 6 місяців тому

    There’s no AI without API

  • @dimagass7801
    @dimagass7801 7 місяців тому

    I have no clue how to use apis I still don't completely understand but data is the new oil😅

  • @ultimatetoast2739
    @ultimatetoast2739 7 місяців тому

    Apicels be seething over scrapechads

  • @Rundik
    @Rundik 9 місяців тому +6

    You don't need any browser to scrape html from reddit. How did you even managed to configure vim with that kind of skills?

    • @night23412
      @night23412 9 місяців тому +1

      what about pressing the next button, don't you need a browser emulator for that?

    • @Rundik
      @Rundik 9 місяців тому

      @@night23412 unless you need to take a screenshot or you don't have much experience/time using puppetier-like tools is extreamly wasteful. And for simple text scraping you don't even need that much experience at all

  • @mr.togrul--9383
    @mr.togrul--9383 9 місяців тому +3

    Great video btw! In the future I also want to make my own web scraper project and this just simplified everything I need to do.
    Is there any reason why you didnt just use Golang for the whole thing, for the scraper as well? just curious, since as you said writing golang would be more faster than node js

    • @JeanHirtz-ms3bf
      @JeanHirtz-ms3bf 9 місяців тому

      Curious about Golang - any repo / vids ?

  • @VRGamerBoi
    @VRGamerBoi 7 місяців тому

    Chatgpt told me about this

  • @TheArchimede2000
    @TheArchimede2000 7 місяців тому

    he never disappoints

  • @_soundwave_
    @_soundwave_ 5 місяців тому

    A very interesting comment section.

  • @mayar2047
    @mayar2047 9 місяців тому

    I'm thinking of just scrape reddit directly from a mobile device, and maybe save the data to the device for caching. I don't need to pay for anything

  • @hqcart1
    @hqcart1 7 місяців тому

    what about cAaptcha ??????????????????????

  • @pixel690
    @pixel690 7 місяців тому +1

    $20 per GB is something different jesus

  • @lowlevell0ser25
    @lowlevell0ser25 9 місяців тому

    They will block things like this with Web Environment Integrity

  • @mikaay4269
    @mikaay4269 7 місяців тому

    Application
    Paying
    Interface

  • @lollermann
    @lollermann 6 місяців тому

    Don't let pyrocynical see this video he'll become a web dev

  • @bieggerm
    @bieggerm 8 місяців тому

    This video shows the only way an arms race should be visualized

  • @DaMu24
    @DaMu24 5 місяців тому

    Ok, give it to me

  • @xybersurfer
    @xybersurfer 9 місяців тому +2

    i was with you until you started putting things in a database and the cloud. was it because your video was sponsored by a cloud provider? (i really can't tell) it would be more interesting to see you justifying decisions. seeing all the code is really not that interesting. the overall idea of creating your own reddit API is interesting though, so i will give this a like

  • @iliabeliaev2260
    @iliabeliaev2260 7 місяців тому

    Old reddit is the only version I use...

  • @itsjohannawren
    @itsjohannawren 7 місяців тому

    Application Profit Initiative