Measuring Reliability
System success requires that the system be available and to work as expected. All teams want a reliable system, but hope is not a strategy. Metrics to provide a framework to help a team understand if their service is indeed reliable. A Service Level Indicator (SLI) defines the user’s experiences as a quantifiable metric such as latency, throughput, or error rate. A Service Level Objective (SLO) sets the target for an SLI, for example error rate over the past five minutes should be larger than 99.9%. A good SLO is one that, when barely met, would keep a typical user happy. When that SLO is not met, your users will eventually stop using your service or go to a competitor. Trends of SLIs can help spot a system failure before users experience outages, trigger an auto scale of cloud infrastructure system, or indicate that downstream dependencies are silently failing. SLO violations can alert the team that something is wrong with your system. After learning the key terms and concepts the demo will turn interactive to work together to create several SLIs and SLIOs for an imaginary system.
Take Aways
- Understand the differences between SLIs, SLOs, SLAs
- Know pros and cons of different methods to capture data for your metrics
- Know how to keep an eye on performance and availability so that you can take action before your users are impacted