Mean Time to Recovery (MTTR)

Rocky Warren
Rocky Warren
December 31, 20201 min read
  • Blue/green deploy, roll forward
  • Timeouts/deadlines for all remote calls
  • Capped exponential backoff retries with jitter at single point in stack (only for idempotent calls based on response code, monitor retries)
  • Well exercised fallbacks
  • What happens when downstream services down? Gracefully degrade, timeouts (Postgres), circuit breakers (for all sync downstream calls via Resilience4j, which also does bulkheads and load shedding)
  • For migrations, etc., two phase deploy with bake time, readers go before writers while rolling forward whereas writers go before readers while rolling backward