Very surprised at 30:03 to hear him say the alert should be as precise as possible. My understanding of Google's SRE practice was to only alert on what users care about - "can they log in, can they add something to their cart?" ua-cam.com/video/CGldVD5wR-g/v-deo.html It seems as though the speaker is muddying good alerts with good monitoring at that point in the talk. Why is this important? Because trying to make alerts super specific, means 100s of alerts and it's too difficult to avoid alert fatigue. Avoiding this fatigue ensures meaningful alerts... I don't need my alert notification to tell me which specific part of the system failed, just that I need to open my monitoring dashboard, and with good monitoring, it will be relatively clear what went wrong.
User focus and precision go along together very well. An alert shouldn't be just "users see an increased rate of HTTP 500s" but "the checkout page sees a high rate of HTTP 500s, specifically when served from Cloud region $foo". This gives you a lot more to work with, saving you critical time when trying to understand what you need to fix the issue the users are experiencing.
DevOps doesn't need to get rid of separate organizations. It's a culture. not an org scheme, not a role. How you implement that culture it's up to you - here we can see Google's approach. You might have a different one.
1.25x FTW
Dude talks on 1.25x like a normal person talks on 1x
Very surprised at 30:03 to hear him say the alert should be as precise as possible. My understanding of Google's SRE practice was to only alert on what users care about - "can they log in, can they add something to their cart?"
ua-cam.com/video/CGldVD5wR-g/v-deo.html
It seems as though the speaker is muddying good alerts with good monitoring at that point in the talk.
Why is this important? Because trying to make alerts super specific, means 100s of alerts and it's too difficult to avoid alert fatigue. Avoiding this fatigue ensures meaningful alerts... I don't need my alert notification to tell me which specific part of the system failed, just that I need to open my monitoring dashboard, and with good monitoring, it will be relatively clear what went wrong.
User focus and precision go along together very well. An alert shouldn't be just "users see an increased rate of HTTP 500s" but "the checkout page sees a high rate of HTTP 500s, specifically when served from Cloud region $foo". This gives you a lot more to work with, saving you critical time when trying to understand what you need to fix the issue the users are experiencing.
"Gmail 500" lmfao
Services Reliability engineer, nice twist!
Doesn‘t sound like DevOps to me, but just traditional silos of Dev and Ops...
SRE rolls SWE into the team for a long-enough period of time to train them on SRE capability. How's that silo?
DevOps doesn't need to get rid of separate organizations. It's a culture. not an org scheme, not a role. How you implement that culture it's up to you - here we can see Google's approach. You might have a different one.
class SRE implements DevOps! ;)