mitigation

  1. timeout
    1. look at the p95 or p90 of the client and use it as timeout.
    2. granularity
      1. start with org level
      2. service level
      3. endpoint level
  2. retries
    1. good in case of anything that is short term and transient
      1. network errors
      2. temp issues in service
    2. retry parameters
      1. max no of retires
      2. interval between retries
        1. immediate
        2. fixed interval
        3. linear incremental
        4. exponential incremental
        5. exponential incremental with jitter
    3. exit from retires
      1. exponential backoff
      2. kill switch
  3. fallback
    1. identify critical flows
    2. set thresholds where without the actual response from the server we can still go ahead
    3. think percentile and scale of loss or scale of damage

    All notes