Skip to Main Content

Making sense of SRE and observability, one week at a time.

What is site reliability engineering (SRE) really about? How can I make sense of it in my organisation? How do I cut through the buzzwords and actually improve the lives of my colleagues and customers?

Latest episode
Watch now

What is chaos engineering and how is it being used in 2025? This week I'm joined by Gremlin CEO and founder Kolton Andrus to discuss... 🌪️ What is chaos engineering and what is its origins? 🪴 How has it evolved over the year? 🤖 The role of AI agents in SRE work 💰 Justifying the value of chaos engineering 🏃‍♀️‍➡️ How do I get started? ...and much more. You can find Kolton on: LinkedIn: https://www.linkedin.com/in/kolton-andrus-77315a2/ And you can find out more about Gremlin's new reliability intelligence platform here: https://www.gremlin.com/technologies/reliability-intelligence You can find Stephen on: LinkedIn: https://www.linkedin.com/in/stephentownshend/ Bluesky: https://bsky.app/profile/slightreliability.bsky.social YouTube: https://www.youtube.com/c/SlightReliability Instagram: https://www.instagram.com/slight_reliability/ TikTok: https://www.tiktok.com/@the_kiwi_sre

Latest episodes
Stephen Townshend

About the host

Stephen has a background in SRE and performance engineering. He has worked in the industry for 15 years as both an external consultant and an internal engineer.

Our industry is full of buzzwords and exaggerations, it can be hard to know what is real or not. Stephen strives to take these complex technical concepts and to simplify and present them in a way everyone can understand and apply (and to call out when something is too good to be true).

Stephen lives in Auckland, New Zealand and currently works as a Developer Advocate for SquaredUp, as well as promoting and improving observability and SRE practices internally in the organisation.