How
Вставка
- Опубліковано 23 лип 2024
- System Design for SDE-2 and above: arpitbhayani.me/masterclass
System Design for Beginners: arpitbhayani.me/sys-design
Redis Internals: arpitbhayani.me/redis
Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.
Sign up and get 40% off - app.codecrafters.io/join?via=...
In the video, I discussed the importance of maintaining search at scale using Elasticsearch at Twitter. Twitter built tooling around Elasticsearch to handle surges in search traffic, real-time ingestion, and backfill. The course on system design focuses on building intuition and covers real-world system design scenarios. Twitter's tooling includes an Elasticsearch proxy for standardization and a backfill service for staggered data ingestion. By deferring rights and ensuring synchronous reads, Twitter maintains stability and scalability in its Elasticsearch clusters.
Recommended videos and playlists
If you liked this video, you will find the following videos and playlists helpful
System Design: • PostgreSQL connection ...
Designing Microservices: • Advantages of adopting...
Database Engineering: • How nested loop, hash,...
Concurrency In-depth: • How to write efficient...
Research paper dissections: • The Google File System...
Outage Dissections: • Dissecting GitHub Outa...
Hash Table Internals: • Internal Structure of ...
Bittorrent Internals: • Introduction to BitTor...
Things you will find amusing
Knowledge Base: arpitbhayani.me/knowledge-base
Bookshelf: arpitbhayani.me/bookshelf
Papershelf: arpitbhayani.me/papershelf
Other socials
I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.
LinkedIn: / arpitbhayani
Twitter: / arpit_bhayani
Weekly Newsletter: arpit.substack.com
Thank you for watching and supporting! it means a ton.
I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff. - Наука та технологія
Thanks Arpit, this helps in drawing parallels to other systems as well. And its so nice to see the fundamentals are quite the same in handling large scale infra
Love these stories of great engineering. Request to please bring these more often. Thanks a lot 🙂
such a great explanation, learned a lot from you arpit sir 😎, keep going 🔥
Very helpful! Thanks a lot sir.
Thanks Arpit for such an informative content
Similar to what we built at Oracle...Oracle Knowledge AI search is similar kind of architecture.we have also introduced vector searches in elastic search
In database systems we can segregate write and read across different DB and eventually make read node consistent with write node data.
I dont much about ES. But was it not an option in ES.
Thanks Arpit! This was a great video! I had a question.
In the backfill process, how does the orchestrator know how many workers to spawn? How do you monitor and calculate the amount of data yet to be processed in HDFS?
If Kafka was used instead of HDFS, I know there's a way to calculate the consumer lag, which can be used to trigger orchestrator's rules.
This was really a crispy one
Bahut acche
But I guess if the read operation is an I/O intensive one , like fetching a yearly orders report from ES it shouldn't be a synchronous operation , rather it should follow the write flow described by you i.e send the report meta details as an event to kafka topics and later workers can mail them the reports asynchronously.
Here also if report is big , how we fetch it can be discussed
But why use ES for analytical queries
better to directly run some spark job on s3/hdfs and refrain from using elasticsearch for such use cases.
Why was HDFS used here? A simple queue(like SQS) or a Kafka if Twitter wanted to have a retry mechanism would have achieved the same.
Staging storage for subsequent consumption.
@@AsliEngineering Thanks for the reply! I was not hoping I would get a reply here.
When backfill is not required Twitter is putting it in elastic search directly and for backfill they are putting it in HDFS. I think the reason would be the constraint of memory in Kafka or SQS. S3 or HDFS do not have that.
Thanks for the great explanation. I have a basic question. What is backfill & it's job here?
Is it about parsing each tweet and doing analysis?
Backfilling updates the index with the latest data crawled from various sources.
Hey Arpit, I am confused, initially you said every team had their own cluster, is the proxy a common service for all the clients of different cluster or each cluster will have its own proxy service?
Hybrid setup is a possibility.
There may be services that have an isolated proxy where there may be a few who share .
Great video dude.
I wanted to ask you about your recording setup. Are you using obs & screen mirroring your iPad or something? Please mention any hardware/software you need for these videos
Obs plus iPad. Nothing more.
@@AsliEngineering I see. So is it iPad that you screen mirror + obs on MacBook? And is the app Notability? Btw your handwriting is awesome!!
Hi arpit, Thanks for your videos, sorry if my question is stupid, I have seen this video and your bookmyshow video also, in both always scaling happens during write opertion only, what about huge no of traffic reads a particular API how api stability is ensured, kindly revert please...
Replicas and Caching.
@@AsliEngineering Thank you
Any HighLevel folks watching this,
It would be very similar to our eventing (& mongo-indexing) service, and the backfill is basically our snapshot service.
Since write is happening in async that particular tweet wouldn't reflect in his tweets immediately right?? so how will the user immediately sees his tweet??
How likely is the user going to search his/her own tweet immediately after posting it?
@@AsliEngineering how to handle such a use case if there's any
@@user-ot3ro8zc6x search systems are never designed to be strongly consistent.
But if you want strong consistency then your API will have to synchronously write to DB and to Search engine. a massive overkill tbh.
@@AsliEngineering Yeah got it.
Thanks Arpit for making this video!
I had some follow up, curious question
- what happens when there is too much data on kafka during backpressure while indexing ?
- can map reduce create an elastic-search understandable file, which can be be used for bulk insertion ? Since in current architecture worker will be again making 1:1 calls.
Why dont api server directly write to kafka instead of proxy
Because it was a system rewrite and they did not want to change any upstream.
Also to add onto Arpit, I would assume the proxy still would have authority over rate of requests, and some kind of auth. In case of strange burst, we could avoid pushing a lot of unwanted data to Kafka.
what if kafka gets too many messages??? will it drop some messages>
Back pressure.
No, the beauty of kafka is its log-append, it will add it and you just have to consume, then you can configure the topics to delete the "older" data based on the configuration (bytes or time or both). Ofcourse there are compacted topics but thats another type of "reducing" the data space (and it has its own problems :) )
@@dharins1636 thankyou
Hi sir I'm 1st year student should I buy your system design course
Not at all. Meant for more than 2 years of work experience.
Are worker nodes spark jobs which are streaming from Kafka and writing in elastic search at particular window or interval @arpit @asliengineering
Could be. Implementation can be anything. Raw consumers, or Spark jobs.