# Fault Tolerance Patterns

Fault Tolerance aka Resilience Patterns

⚡️ Tags: 📍Pattern 📍ResiliencePattern 📍DistributedSystem

The Fault Tolerant means the ability of an architecture to survive (tolerate) when an environment misbehaves by taking corrective actions, e.g, surviving a server crash or preventing a misbehaving API from bringing down the whole system, etc.

The Fault Resilience is probably the capacity to recover from these type of scenarios quickly.

After further reading of Netflix blogs and wikis, it seemed the terms Fault Resilience and Fault Tolerant were used interchangeably.

# Cheatsheet

Cheatsheet

# 2 metrics for Resilient system

2 metrics for Resilient system

Error budgetting

Allowable Downtime = Total tiem & (1 - Uptime %)

Mean time to Recovery (MTTR)

MTTR = Total Downtime / Number of incidents

# Pattern List

There are various resilience patterns that can be applied to microservices, categorized by their impact on downstream and upstream components. Let's explore some of these patterns in detail.

patterns

Pattern Impact on
Retry (exponential backoff, retry amplification) Downstream / Upstream
Bulkhead Downstream
Circuit Breaker Downstream
Timeout Downstream
Rate limiting Upstream
Load leveling Upstream
Load shedding Upstream
Fallback Upstream
Failover Upstream
Stateless

These patterns are usually not used alone. To apply them effectively, we need to understand why we need them, how they work, and their limitations.

# Shopify case study

Shopify

# Prevent misuse & overload

Prevent misuse & overload

# Anti-corruption layer

Anti-corruption layer

# Retry patterns

Retry patterns

# Fallback

When the the service request fails after retry, we need an alternative response to the request. Fallback provides an alternative solution during a service request failure. When the circuit breaker trips and the circuit is open, a fallback logic can be started instead. The fallback logic typically does little or no processing, and return value. Fallback logic must have little chance of failing, because it is running as a result of a failure to begin with.

In this approach, the client process performs fallback logic when a request fails, such as returning cached data or a default value. This is an approach suitable for queries, and is more complex for updates or commands.

# Deadlines / distributed timeouts

// TODO

# Asynchronous communication

Use Asynchronous communication (for example, message-based communication) across internal microservices

# Stateless Services

To make sure our services are scalable, we need to make sure we build them in a stateless manner. By stateless, we mean that the service does not retain any state from the previous calls and treats every request as a fresh, new one. The advantage this approach gives us is that we can easily create replicas of the same service and make sure it does not matter which service instance is handling the request.

Read more

# Correlation ID

In a typical microservices-based application, several services span different systems, possibly separated by large geographical distances. This means each service must log useful and meaningful data that specifies what it has been doing and details any failures. This requires a third microservices resiliency pattern geared towards service tracking.

A correlation ID pattern creates an identifier for each individual request. This can help you track the complete flow of each HTTP request through all communication channels. You can set the correlation ID as part of the HTTP request header and include it in every log message, which will help you to quickly find errors, warnings and relevant debug messages.

While a correlation ID will illustrate the flow of a request from the source to the destination, a log aggregator pulls together the logs from all your microservices for easier search and analysis.

Detail: 3 microservices resiliency patterns for better reliability (opens new window)

# Refs