“Timeouts, Retries, and the Devil” dives into the often-overlooked complexity of error handling in high-scale systems. We’ll explore why short timeouts are essential for resilience—but dangerously easy to misuse—and how naive retries can spiral into system-wide meltdowns. Using real-world examples, including a postmortem-style analysis of a production incident inspired by real world problems, I’ll walk through actionable strategies to design smarter retries, prevent service hammering, and avoid cascading failures. This is a talk full of practical wisdom, gotchas, and system design insights that will stick with you the next time you’re writing retry().
Room: Room 2
Mon, Oct 27th, 11:10 - 11:40