- timeout
- look at the p95 or p90 of the client and use it as timeout.
- granularity
- start with org level
- service level
- endpoint level
- retries
- good in case of anything that is short term and transient
- network errors
- temp issues in service
- retry parameters
- max no of retires
- interval between retries
- immediate
- fixed interval
- linear incremental
- exponential incremental
- exponential incremental with jitter
- exit from retires
- good in case of anything that is short term and transient
- fallback
- identify critical flows
- set thresholds where without the actual response from the server we can still go ahead
- think percentile and scale of loss or scale of damage