On-Call In Action: Site Reliability Engineering Best Practices for Building Resilient Systems

Independently published
SKU:
9798283556314
|
ISBN13:
9798283556314
$21.82
(No reviews yet)
Condition:
New
Usually Ships in 24hrs
Current Stock:
Estimated Delivery by: | Fastest delivery by:
Adding to cart… The item has been added
Buy ebook
In today's "always-on" world, downtime is not an option. Your users expect seamless service, 24/7. Your business depends on it. But how do you guarantee that reliability when complex systems inevitably encounter turbulence? The answer lies in a world-class on-call capability."On-Call In Action" is your practical playbook for building just that. This isn't just another theoretical tome; it's a hands-on guide to navigating the high-stakes reality of modern on-call. We'll equip you with the SRE principles, incident management lifecycles, and effective alerting strategies (leveraging the Versus Incident project as our real-world example) that form the backbone of resilient operations.This book, "On-Call In Action," is your friendly guide to making on-call work better. We'll show you: Why being on-call is so important.What to do when a problem (we call it an "incident") happens.How to set up good alerts so you only get called for big problems. We'll even show you how with a free tool called "Versus Incident."How to check if your services are running well (using simple goals).How to learn from mistakes without blaming anyone, so things get better.How to make good on-call schedules so people don't get too tired.How to create a supportive team for on-call work.Stop just reacting to problems and start engineering reliability. Whether you're a tech person who is on-call, a manager, or just curious, this book will give you clear advice and real examples. We want to help you build an on-call system that keeps your services running and your team feeling good. This book contains 11 chapters: Chapter 1 Foundations: Why On-Call Matters & SRE PrinciplesChapter 2 Anatomy of an Incident: The Management LifecycleChapter 3 Effective Alerting: Strategy and Routing Use Versus IncidentChapter 4: Integrating Monitoring Sources and Escalation Policies: A Case StudyChapter 5: Measuring Reliability: SLIs, SLOs, and Error BudgetsChapter 6: Putting It All Together: Practical Examples of Unified Alerting & TemplatingChapter 7: Learning from Failure: Blameless PostmortemsChapter 8: Sustainable On-Call: Scheduling and Managing BurnoutChapter 9: Effective IncidentChapter 10: The On-Call Ecosystem: Tooling and Future TrendsChapter 11: On-Call in Action: Digital Customer Onboarding in Banking


  • | Author: Quan Huynh
  • | Publisher: Independently Published
  • | Publication Date: May 14, 2025
  • | Number of Pages: 00182 pages
  • | Binding: Paperback or Softback
  • | ISBN-10: NA
  • | ISBN-13: 9798283556314
Author:
Quan Huynh
Publisher:
Independently Published
Publication Date:
May 14, 2025
Number of pages:
00182 pages
Binding:
Paperback or Softback
ISBN-10:
NA
ISBN-13:
9798283556314