Slight Reliability | Learning SRE, one day at a time

Making sense of SRE and observability, one week at a time.

What is site reliability engineering (SRE) really about? How can I make sense of it in my organisation? How do I cut through the buzzwords and actually improve the lives of my colleagues and customers?

Latest episode

Follow Slight Reliability on

Listen or watch on

Watch now

How do you design, implement and evolve effective alerting of your services and systems? This week I'm joined by Krisha Vinnakota, Senior SRE @ Microsoft to dive into this topic. We cover... 👩‍💻 Treating your alerts like production code ✅The link between SLOs and alerting 🌡️ How do you decide what thresholds to set? 🔊 Signal to noise ratio and how to manage it 🐒 Chaos engineering and postmortems are levers to improve alerting ...and much more. You can find Krishna on... LinkedIn: https://www.linkedin.com/in/krishna-vinnakota-8a03408/ You can buy Slight Reliability merch here (Note: you cannot order the mugs outside of New Zealand): https://slightreliability.digitees.co.nz/ You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app/profile/slightreliability.bsky.social YouTube: https://www.youtube.com/c/SlightReliability Instagram: https://www.instagram.com/slight_reliability/ TikTok: https://www.tiktok.com/@the_kiwi_sre

Latest episodes

About the host

Stephen has a background in SRE and performance engineering. He has worked in the industry for 15 years as both an external consultant and an internal engineer.

Our industry is full of buzzwords and exaggerations, it can be hard to know what is real or not. Stephen strives to take these complex technical concepts and to simplify and present them in a way everyone can understand and apply (and to call out when something is too good to be true).

Stephen lives in Auckland, New Zealand and currently works as a Developer Advocate for SquaredUp, as well as promoting and improving observability and SRE practices internally in the organisation.

Making sense of SRE and observability, one week at a time.

Latest episode

Follow Slight Reliability on

Listen or watch on

Watch now

Latest episodes

Effective Alerting with Krishna Vinnakota (Episode 124)

The Trouble with Certificates with Charlie Al-Batty (Episode 123)

Being a Digital Nomad with Amin Astaneh (Episode 122)

Four Golden Signals to Kickstart SRE (Episode 121)

Staying Motivated as a Leader with Cads Oakley (Episode 120)

A Beginner's Guide to SRE (Episode 119)

Freeing Observability Data Hostages with Jacob Leverich (Episode 118)

How to Change the World with Rob Roe (Episode 117)

Human Software with Richard Bown (Episode 116)

Leadership Gym with Xiao Zhang (Episode 115)

Starting a New Role (Episode 114)

AI Use-cases for SRE with Shmuel Kliger (Episode 113)

Operational Intelligence with Adam Kinniburgh (Episode 112)

Leading Platform Teams with Dinesh Sukhija (Episode 111)

Leadership Round One! (Episode 110)

The Implications of AI on Observability with Aaron "Checo" Pacheco (Episode 109)

Chaos Engineering with Kolton Andrus (Episode 108)

Team Topologies with Luke McManus (Episode 107)

Contributing to Open Source with Wendy Ha (Episode 106)

Influencing Leadership with Nora Jones (Episode 105)

Podcast Retrospective (Episode 104)

Burnout with Colette Alexander (Episode 103)

Mobile Observability with Hanson Ho (Episode 102)

Intro to Resilience Engineering with Michelle Casey (Episode 101)

Learning with John Allspaw (Episode 100)

Focusing on What Matters with Trent Hornibrook (Episode 99)

The Root Cause Fallacy with Andrew Hatch (Episode 98)

Episode 97 - Synthetic Monitoring with David Dick

Episode 96 - Tech Leadership with Milan Brown

Episode 95 - Finding Tech Work with Leon Adato

Episode 94 - Getting a Start in SRE with Priyam Kumar

Episode 93 - SRE Leadership with Michelle Casey

Episode 92 - Observability Maturity with Ádám Tóth

Episode 91 - Head in the Clouds

Episode 90 - Non-Prod Reliability Engineering + 2024 Wrap

Episode 89 - Blameless Post-mortems with Karanveer Anand

Episode 88 - OpenTelemetry Revisited with Zach Michel

Episode 87 - Measuring the value of SRE with Artem Yakimenko

Episode 86 - Evolving SLOs with Dom Finn

Episode 85 - Feeling SaaSsy

Episode 84 - Clinical Troubleshooting with Dan Slimmon

Episode 83 - An Unfulfilled Promise with Itiel Shwartz

Episode 82 - CI/CD with Amin Astaneh

Episode 81 - Incident Management in Non-Prod Environments

Episode 80 - What's Been Bugging Niall Murphy

Episode 79 - Incident Story Time with Valeska Victoria

Episode 78 - Developer Experience with Ankit Jain

Episode 77 - SRE to DevRel with Liz Fong-Jones

Episode 76 - Sampling Distributed Traces with Paige Cruz

Episode 75 - Enterprise SRE with Steve McGhee

Episode 74 - The Hidden Side of Vendor Lock-In

Episode 73 - Enterprise SLOs with Brian Singer

Episode 72 - Rapid Incident Response with Valeska Victoria

Episode 71 - Implementing SRE with Dr. Vlad Ukis

Episode 70 - Meta SRE with Amin Astaneh

Episode 69 - Developer to SRE with Praveen Kasam

Episode 68 - Dashboards and Modern Observability with Eric Schabell

Episode 67 - Single Pane of Glass with Jamie Allen and Adam Kinniburgh

Episode 66 - Building Digital Assistants for SRE with Kyle Forster

Episode 65 - The Truth About Incidents with Courtney Nash

Episode 64 - Observability During Development with Martin Thwaites

Episode 63 - The Power of Summary

Episode 62 - On-Call with Matt Brown

Episode 61 - SRE VS DevOps VS Platform Eng... (Yawn)

Episode 60 - From Zero to SRE with Amin Astaneh

Episode 59 - Bad API Observability with Sonja Chevre

Episode 58 - Tackling Cloud Cost with Harinder Seera

Episode 57 - A Tale of Three Conferences

June 9th 2023 Update

Episode 56 - Dashbored

Episode 55 - Reflections on KubeCon with Bruce Cullen

Episode 54 - Trends in Incident Management with Andy Thurai

Episode 53 - DORA Metrics with Tim Wheeler

Episode 52 - Double, Double, Toil and Trouble!