Ability to recover from failure

in tech ability to switch of the whole or parts of the system and still being able to recover is the first step in resiliency.

making the lower level components replaceable is one way of improving both reliability and resiliency

Observability plays a critical role in resiliency and RCAs help us learn from our failures so that we are more aware of things that lead to failures and how we can recover from them.

terms

  1. recovery
  2. automatic recovery

Referenced in:

All notes