Design Patterns for High Availability: What gets you 99.999% uptime?
Вставка
- Опубліковано 5 лип 2024
- In this video, we discuss the topic of availability in distributed systems.
We categorize organizations based on their acceptable levels of availability, ranging from startups to mature companies aiming for five to six nines of availability.
InterviewReady: interviewready.io/
We share a real-world example of a startup facing availability challenges with its database hosted in the wrong region. The solution involves migrating the database to a more suitable location and implementing a step-by-step process to minimize downtime.
Here are five principles for building highly available systems:
1. Simplicity over Perfection
2. Downtime Over Loss
3. Lesser Moving Parts
4. Chaos Engineering
5. Incident Reports and Root Cause Analysis
We also touch upon fault tolerance strategies such as redundancy, load balancing, and database replication to ensure high availability in distributed system components.
Engineers should either leverage existing highly available systems or adopt a principled approach to building and maintaining availability in their systems.
00:00 Who is this video for?
00:20 The 9s of availability
02:43 War Story at InterviewReady
06:14 Principles for High Availability
09:05 Design Patterns for Availability
12:31 Conclusion
12:49 Thank you!
Designing Data-Intensive Applications Book: amzn.to/3SyNAOy
You can follow me on:
Github: github.com/InterviewReady/sys...
Instagram: / interviewready_
LinkedIn: / interview-ready
Twitter: / gkcs_
#HighAvailability #SystemDesign #SoftwareEngineering
Love your content. Your wisdom and knowledge is immense.
Great video, packed with information!
Damn, awesome video. I realized this was the architecture in my startup workplace.
Amazing! In my opinion, this is the most resourceful video with a lot of content easily comprehended within just 13 minutes. Would love to see more of these... :)
Thank you!
Usually its not that there is complete outage for x mins/sec depending on availability target but rather some number of Service queries result in failures all throughout the time span. This actually requires availability to be measured wrt. to number of failed queries vs successful queries.
That's a great point, thank you!
Government websites' downtime is approx 10 days. Eventhough the user experience is frustratingly bad, we are left with no choices. 😢 Their usage graph is exponential post the downtime even if the system upgrade decides to mess you up even more.
Great video, been watching your channel for years. Small suggestion: improve audio using a mic.
Great thanks
how can i calculate no. of server replicas i need, if i have 1M users with peak hours 10 - 11 pm serving 80% traffic?
What are the moving parts? Some example plssss
I wouldn't say that if the cache is down then it's alright, unless it restarts really quickly, you might get a thundering herd of requests to the main db, which might make it run out of resources thus causing a big problem.
Other than that, great video 😊
Good point!
Scaling the database ia the hardest problem you will face.
You missed a level, Government websites- Availability ~ 70%
exam result websites, Availability ~ 7%
First