Back the func off, this is my abstraction!

Integrating services over API's exposes your application to a range of possible failures. At scale, any network interaction can and will fail. Implementing a retry mechanism is a common approach to increase fault tolerance. Taking into account how systems fail when designing software can greatly improve the quality of your code. In this blogpost I'd like to show how I approach these kinds of problems.

Let's dive into some of the design considerations you might make when designing abstractions that reach over a network. The specific areas we'll be going over are:

Embracing failure in abstraction design
Designing exceptions for abstractions
Increasing fault tolerance with retries
Implementing a back-off strategy

In this post I'll be referencing different layers. For clarity, here are the descriptions of the layers that I use:

Consumer Layer
This is the code that uses the abstraction. This layer is very high-level and has very little implementation specific code.
Abstraction Layer
The abstraction layer is used by the consumer layer. It defined interfaces and implementation agnostic code (algorithms, services, value objects, etc). Code inside the abstraction layer is high-level when compared to the implementation layer and lower-level when compared to the consumer layer.
Implementation Layer
This layer contains code that satisfies interfaces defined by the abstraction layer. As expected, this layer contains the most implementation details. Code in the consumer layer should not need to know the details contained in this layer.

 ----------------------------
 | consumer                 |
 |--------------------------|
 || abstraction            ||
 ||------------------------||
 ||| implementation       |||
 ||------------------------||
 |--------------------------|
 ----------------------------

Embracing failure in abstraction design

Abstractions provide value by hiding implementation details and supplying an API that makes sense to their consumers. Abstractions reduce complexity for their consumers. Abstraction design has many different aspects. The developer experience of an abstraction can make or break it. Exception design is an important part of abstraction design, yet often overlooked. To understand the value of exceptions in abstractions, let's look at an example.

We'll be building part of a car rental service. Cars that are available for rental are part of a fleet. A fleet is a collection of cars. When a car is rented out, the car is pulled from the fleet which prevents double bookings.

interface Fleet
{
	public function availableCars(PriceRange $priceRange): AvailableCars;

	public function markUnavailable(AvaibleCarId $id): void;
}

The fleet is used in a service layer. The service layer orchestrates a business process. In this case it responds to the RentCar command by issuing a rental agreement, after it pulls the car from the fleet.

public function handle(RentCar $command): void
{
    $this->rentalAgreements->issueAgreement(
    	$command->carId(),
        $command->customer()
    );
    
    $this->fleet->markUnavailable($command->carId());

}

Since the Fleet is only an interface, we need to have an implementation for the consuming code to work. Let's create a database-based implementation.

class DatabaseFleet implements Fleet
{
	public function __construct(private DatabaseConnection $db) {}

	public function availableCars(PriceRange $priceRange): AvailableCars
    {
    	$rows = $this->db->select('cars', 'c')
        	->where('c.rended', '=', false)
            ->where('c.price', '>=', $priceRange->minAmount())
            ->where('c.price', '<=', $priceRange->maxAmount())
            ->execute();
            
        return $this->mapToAvailableCars($rows);
    }
    
	public function markUnavailable(AvaibleCarId $id): void
    {
		$this->db->update('cars', 'c')
        	->where('c.id', '=', $id->toString())
            ->set('rended', true)
            ->execute();
    }
}

With the interface satisfied, we can now use the system. Mission complete, right? Well... not exactly. Without failures everything is fine. However, there's alway the chance that a database interaction fails. A network interaction may fail at any time for any number of reasons. When this happens (and it will) the system needs to respond accordingly.

In our case, when pulling the car from the fleet fails, the car remains available. This can result in a double booking, which is a problem. The database layer used in the implementation layer throws an exception whenever a database error occurs. This exception we could be handled in the consuming code by catch-ing it.

public function handle(RentCar $command): void
{
    $this->rentalAgreements->issueAgreement(
    	$command->carId(),
        $command->customer()
    );
    
    try {
    	$this->fleet->markUnavailable($command->carId());
    } catch (DatabaseException $e) {
        // handle exception
    }
}

When a consumer of an abstraction is aware of an implementation detail we have a problem with our abstraction. In this case, the consumer handles an exception defined in the implementation layer. This exception is not part of the abstraction. Problems such as these are called abstraction leaks.

Designing exceptions for abstractions

Abstraction leaks can be fixed by preventing consumers to be exposed to implementation details. For exceptions, this is done by creating abstraction-specific exception types. These exceptions live in the same place where you design the interface of your abstraction. The name of the exception should make sense to the consumer of the abstraction regardless of what implementation is behind it. The exceptions should also be relevant to the consumer of the abstraction.

An exception for this particular abstraction might look something like:

final class UnableToMarkCarUnavailable extends RuntimeException
{
	public function because(string $reason, Throwable $previous): self
    {
    	return new self(
        	"Unable to mark car unavailable. $reason",
            0,
            $previous
        );
    }
}

Once introduced, the consumers of the abstraction can now take this new exception into account.

public function handle(RentCar $command): void
{
    $this->rentalAgreements->issueAgreement(
    	$command->carId(),
        $command->customer()
    );
    
    try {
    	$this->fleet->markUnavailable($command->carId());
    } catch (UnableToMarkCarUnavailable $e) {
        // handle exception
    }
}

We have eliminated an abstraction leak. With the implementation details removed consumers can now remain unaffected when the implementation is replaced. Their code will function in the same way, even though what happens under the hood is totally different. An example of this would be when a database based implementation is swapped out for one that uses an API client. The consumer of the abstraction does not care what happens internally.

By using abstraction specific exceptions, consumers can deal with failures in an implementation-agnostic way. In short, when implementations of abstraction can fail, create exceptions that represent the failure at an abstraction-level.

Increasing fault tolerance with retries

Integrating with HTTP-based API's exposes an application to a wide range network related issues. To increase reliability of our service retries can be introduced.

When added a retry mechanism you first need to determine where to place it. Retry mechanisms can be placed inside the abstraction, inside a specific implementation, or in the consuming code. When placed in the consumer code, the consumer has direct control over the retry mechanism, but you may have to implement the same logic over and over. Placing the retry mechanism inside the implementation reduces the complexity of abstraction. When there are many different implementation, you have to implement in each of the implementations. Each of these options has up-sides and down-sides. It's good to consider which option is right for you. Generally speaking, pulling complexity down is favourable because it lowers complexity for consumers.

In our case, we may want to increase the fault tolerance of the markUnavailable call. Let's implement a retry mechanism.

public function handle(RentCar $command): void
{
	...
    
    
    $attempts = 0;
    start:
    try {
    	$attempts++;
    	$this->fleet->markUnavailable($command->carId());
    } catch (UnableToMarkCarUnavailable $e) {
		if ($attempts <= 10) {
        	usleep(100000);
       		goto start;
        }
        // handle exception
    }
}

We've modified our implementation to retry the operation whenever we are unable to mark a car unavailable. By catching the abstraction's exception this retry mechanism will work for any underlying implementation.

The retry mechanism uses a 0.1 second delay between calls. Quick retries cause the client and the server (recipient of the call) to do a lot of work. When a server is overloaded and multiple clients are quickly retrying can cause systems to overload. A common practice is to increase the delay after each failed attempt.

public function handle(RentCar $command): void
{
	...
    
    
    $attempts = 0;
    $delay = 100000;
    start:
    try {
    	$attempts++;
    	$this->fleet->markUnavailable($command->carId());
    } catch (UnableToMarkCarUnavailable $e) {
		if ($attempts <= 10) {
        	usleep($delay * $attempt);
       		goto start;
        }
        // handle exception
    }
}

Now we've used the attempt counter to calculate how long we wait between calls.

Implementing a back-off strategy

Over time you may need to change how you increase your back-off time. Instead of backing off linearly you may want to back off exponentially. There are even papers that claim backing off based on a Fibonacci sequence is the way to go.

A back-off strategy can calculate the back-off time using an attempt counter. The strategy turns the counter into a duration and uses it to wait. Adding these algorithms increases the complexity of our code. We can reduce complexity by pulling it down into an abstraction.

We can create an interface to encapsulate the back-off behaviour. The abstraction will be responsible for calculating the waiting time, and for throwing the exception when we're exhausted all of our tries.

interface BackOffStrategy
{
	/**
     * @throws Throwable
     */
	public function backOff(int $attempt, Throwable $exception): void
}

A exponential implementation of this interface can look like this:

class ExponentialBackOffStrategy implements BackOffStrategy
{
	private const INTIAL_DELAY = 100000;
 	private const MAX_ATTEMPTS = 25;

	public function backOff(int $attempt, Throwable $exception): void
    {
    	if ($attempt > self::MAX_ATTEMPTS) {
        	throw $exception;
        }
        
        $duration = self::INITIAL_DELAY * 2 ** ($attemts - 1);
        usleep($duration);
    }
}

Our consuming code can now be converted to consume the back-off abstraction.

public function handle(RentCar $command): void
{
	...
    
    
    $attempts = 0;
    start:
    try {
    	$attempts++;
    	$this->fleet->markUnavailable($command->carId());
    } catch (UnableToMarkCarUnavailable $e) {
		$this->backOff->backOff($attempts, $e);
        goto start;
    }
}

The code that consumes the back-off abstraction is now unaware of how a back-off is performed. We removed the need to know about how the back-off happens, which has decreased the complexity.

Changing the back-off strategy at a later point can be done by supplying an alternate implementation, but doesn't require any consumer code to change. Creating points in your design to be swapped out like this increases the stability (how often does code need to change) of our code.

If you're looking to use a back-off strategy like this, checkout EventSauce's BackOff package. It uses jitter to further improve the back-off strategy.

To sum it up

We've looked at an abstraction and uncovered an abstraction leak. We added abstraction-level exceptions to remedy the leak. We added a retry mechanism to increase res. Lastly, we encapsulated the back-off behaviour. By doing this, we have improved the quality of our abstraction, increased the code stability, and increased the system's fault tolerance.

I hope you've enjoyed a peek into how I design software. For me, software design is shaped by many complimentary (and at time contradicting) ideas. Looking forward to hearing what you think of this approach.

Comments can be posted on reddit.

Bonus section!

An alternate place for the retry mechanism was discussed in the reddit comments. At first I didn't want to overload this blogpost with too many design elements, but I'll add it anyway.

The retry mechanism can also be added as a decorator on the Fleet implementation. This is a great way to compose systems and allows you to transparently add behaviour to a system.

class RetryingFleet implements Fleet
{
	public function __construct(
    	private Fleet $fleet,
        private BackOffStrategy $backOff,
    ) {}
    
    public function availableCars(PriceRange $priceRange): AvailableCars
    {
    	return $this->fleet->availableCars($priceRange);
    }
    
    
	public function markUnavailable(AvaibleCarId $id): void;
    {
        $attempts = 0;
        start:
        try {
            $attempts++;
            $this->fleet->markUnavailable($command->carId());
        } catch (UnableToMarkCarUnavailable $e) {
            $this->backOff->backOff($attempts, $e);
            goto start;
        }
    }
}

Now that this behaviour is extracted, the consumer layer is unaware of the retry mechanism. The service layer that responds to the command looks like our original setup.

public function handle(RentCar $command): void
{
    $this->rentalAgreements->issueAgreement(
    	$command->carId(),
        $command->customer()
    );
    
    try {
    	$this->fleet->markUnavailable($command->carId());
    } catch (UnableToMarkCarUnavailable $e) {
        // handle exception
    }
}

Like with the other design decisions we've made, this solution is not a silver bullet either. Although our consumer, abstraction, and implementation layer are now unaware of the retry mechanism, we've made the system a little harder to understand for developers reading our code. The decoration based on the interface has added a layer of indirection. When adding too many layers of indirection it becomes very difficult for code readers to understand what is going on. There is no wrong or right in this case, it's a balancing act.