Designing idempotent serverless systems

Feb 27th, 2024

System Design Asynchronous architecture AWS Serverless

Idempotency is often overlooked in API design, yet crucial for reliability. Design and build asynchronous API that ensures robust performance and resilient operations.

Embracing Failures

Consider hotel booking system. Booking a stay is a complex operation that "Under the hood" requires many steps to ensure a successful booking:

Customer initiates the process by placing request.
The hotel validates and confirms the booking request.
Room is allocated and reserved for the customer.
Payment processing follows to secure the reservation.
Confirmation is dispatched to the customer with booking details.

Often such a complex operation is decomposed into a controlling process that invokes several smaller services, with each service handling a specific aspect of the entire workflow. Whenever one service or system invokes another, failures are a possibility. They can be caused by many reasons, network issues, database availability, load balancers, system errors and many more.

Drawing from this example it is evident that constructing a software system is a complex and challenging process, requiring a series of smaller steps to achieve the desired outcome. Each of this steps can fail from multiple reasons. Given the complexity of even a single operation, not to mention the chain of operations, it's essential to develop systems that acknowledge the possibility of errors and are resilient to them.

What to do in case of errors

Okay, failures happen, we get it, so what can we do about it? It's rare that systems fail as a whole. It's more common that they suffer a partial failure when only subset of operation succeeds. The second popular option of failing is a transient failure, which occurs when a request fails for a short period of time. The easiest approach in such a case is just to retry (surprisingly, in most cases, it works really well). Retry mechanism is a default built in mechanism of many AWS services. For example, in the event of a Lambda failure, it will automatically attempt to retry the operation three times by default. For the retry mechanism to function effectively, it's essential to ensure that each operation within the system can be retried without any side effects.

Distributed Systems

Coming back to the example of the hotel booking system, it's likely that this system would be divided into microservices, effectively making it a distributed system. In distributed systems, parts of the system must communicate, a task often accomplished by implementing some form of messaging mechanism. There are many implementations available, each with its own distinct characteristics. Some use cases are more suitable for events (such as Amazon EventBridge), while others benefit more from queues (like Amazon SQS) or implementations outside of AWS, such as Kafka or RabbitMQ. In summary, there are numerous methods for transmitting messages from one part of the system to another. Each of these possibilities offers something unique and also has its own unique drawbacks. However, all of these solutions share one common limitation: it is impossible to achieve exactly-once delivery.

There's even this popular joke:

Since there isn't a guarantee that messages will be delivered only once, when designing systems, we must create a system that expects that the same message will be processed multiple times. Therefore, as with the retry strategy, the system must allow each operation to be performed multiple times without causing any side effects.

What is idempotence

Idempotence is a property in computer science and mathematics where an operation can be applied multiple times without changing the result beyond the initial application. In other words, performing the operation once has the same effect as performing it multiple times (no side effects).

Here are some examples of idempotence:

Idempotency with HTTP Request Methods

Some HTTP methods are idempotent by default. A safe method is one that does not modify the state of the server, essentially meaning that is a read-only operation. Additionally, a safe method is always idempotent. The following table shows which methods are idempotent and / or safe.

Name	Is safe?	Is idempotent?
`GET`	✅	✅
`HEAD`	✅	✅
`PUT`	❌	✅
`DELETE`	❌	✅
`POST`	❌	❌
`PATCH`	❌	❌

GET, HEAD

In case of GET and HEAD requests data is only read and no state modification happens, therefore those operations are both safe and idempotent.

PUT

In case of PUT request first invocation updates the resource (for instance updating customer's address). The other requests will basically overwrite the same resource again, therefore not making any change in the state. This operation is idempotent but not safe since there's a write operation happening.

DELETE

In case of DELETE request first invocation will delete resource ultimately resulting in 200 (or 204) response. The other requests will result in 404 response but the state of the resources will not be updated. This operation is idempotent but not safe since there's a write operation happening.

Important Note

When considering idempotence, it's crucial to distinguish between the results (as of response returned to the client) and any side effects it may produce. For example, with a DELETE operation, the response returned to the user might vary (such as a 200 response code on the first request and a 404 response code on the second), but the server's state after the initial operation remains unchanged.

POST

In case of POST invoking endpoint N amount of times will result in N amount of resources being created each time changing the servers state, therefore this operation is not safe and not idempotent by default.

PATCH

In case of PATCH each invocation can potentially result in different outcomes. Let's take as an example updating address then we can send something like { "address": "foobar" }, invoking this N amount of times will not cause any side effects but PATCH operation is much more general in a possible ways that updates the resource for instance { "op": "move", "from": "/a/b/c", "path": "/a/b/d" }. This operation is non idempotent therefore PATCH is both not idempotent and not safe by default. If you want to read more about how to apply JSON patch, this resource is recommended for further reading.

Idempotent API Design

With that being said, we can now proceed to ensuring idempotency in API design. As previously mentioned, some operations do not require any additional work, such as a GET request to the bookings/{id} endpoint, as illustrated in the initial hotel booking example. However, in the case of a POST request to the bookings endpoint, it is essential to ensure that multiple invocations of the same endpoint (as in a retry strategy) do not result in multiple bookings being placed. There are several solutions to this issue; for instance, it can be assumed that multiple requests with the same body within a short time span may be duplicates of the same request. However, this assumption may not always hold true. Consider a scenario where someone wishes to book multiple rooms in a hotel for a group trip with friends. In this case, they may indeed create multiple requests to book stay at the same hotel with the same properties.

{
"propertyId": "id",
"checkInDate": "2024-02-22T14:00:00.000Z",
"checkOutDate": "2024-02-24T11:00:00.000Z"
}

My preferred approach involves attaching a unique identifier to each call within the request headers. This identifier is generated on the client side. Requests sharing the same client request identifier can be identified as duplicate requests and managed accordingly. Below is presented sequence diagram that visualize this concept.

Sequence diagram

Idempotent Handlers Design

In the case of handlers in distributed systems that operate based on pulling messages, this approach can also be utilized. A key requirement for its effectiveness is to embed some form of unique identifier within the message body, facilitating easy identification of the message.

Sequence diagram for events

With the theory behind idempotency presented, we can now proceed to creating and designing an example application that demonstrates the concepts discussed in code.

Prerequisites

This is not an entry level article, it's geared towards readers that already created their first serverless applications. If you're new to serverless development, this article is recommend before starting. It provides a step-by-step guide and covers the fundamentals.

Application goals

In this tutorial, we will build an application based on the previously mentioned hotel booking example. Placing a booking will occur by accessing the POST /bookings route, while accessing the GET /bookings/{id} route will return booking details. Accepting a booking by the host will happen by accessing the PUT /bookings/{id}/accept route. In the background, after acceptance, payment will be processed.

Application architecture

The application is divided into two parts. The first part is a synchronous API that utilizes three services: Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. The second part involves asynchronous processing of messages stored in an Amazon SQS queue. This setup aims to mimic a more complex distributed system where messages are processed asynchronously, in order cover both earlier discussed types of systems (synchronous API exposed to user and asynchronous distributed system that process the messages in the background). Idempotency is handled with the usage of AWS Powertools, DynamoDB is used as a state store that keeps track of idempotency keys. Diagram of the architecture is presented below.

Sequence diagram for events

Warning

The presented architecture is designed purely for educational purposes, and as such, it includes certain simplifications that result in flaws (for example lack of implementation of a Dead Letter Queue (DLQ) for the paymentProcessor lambda).

Helpful Tip

In a real-life scenario, I would consider using events instead of messages stored in a queue. This approach would make the system more flexible, as events can be consumed by multiple consumers, rather than just a single recipient. (Of course, Event-Driven versus Message-Driven architecture is a much more complex subject, on which I am planning to write a separate article).

Creating persistence layer

As a state store DynamoDB was chosen because of it's ease of integration with Lambda Powertools. Table was initialized with given config.

IdempotencyTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: ${self:custom.resourceSlug}-idempotency-table
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: id
          AttributeType: S
      KeySchema:
        - AttributeName: id
          KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: expiration
        Enabled: true

Helpful Tip

Take a look at TimeToLiveSpecification property. This is a neat feature of DynamoDB that allows records to be deleted after a specified timestamp. Here is the full documentation and explanation on how it works.

With state store table being setup in AWS we can create a TypeScript representation.

import { DynamoDBPersistenceLayer } from "@aws-lambda-powertools/idempotency/dynamodb";
import { BasePersistenceLayer } from "@aws-lambda-powertools/idempotency/persistence";

export const persistenceStore: BasePersistenceLayer =
  new DynamoDBPersistenceLayer({
    tableName: checkForEnv(process.env.IDEMPOTENCY_TABLE),
  });

Important Note

Detailed implementation of BookingTable and BookingRepository is omitted in order to make sure that the focus of article is put on idempotency-related subjects. Please refer to Conclusion section where link to code repository is provided.

Creating placeBooking lambda function

First step is to define a lambda in the yaml file. Notice how lambda is given permission to perform all necessary operation on idempotency state store.

placeBooking:
  handler: src/functions/placeBooking/placeBooking.handler
  environment:
    BOOKING_TABLE: !Ref BookingTable
    IDEMPOTENCY_TABLE: !Ref IdempotencyTable
  events:
    - http:
        path: bookings
        method: POST
  iamRoleStatements:
    - Effect: Allow
      Action:
        - dynamodb:GetItem
        - dynamodb:PutItem
        - dynamodb:UpdateItem
        - dynamodb:DeleteItem
      Resource: !GetAtt IdempotencyTable.Arn
    - Effect: Allow
      Action:
        - dynamodb:PutItem
      Resource: !GetAtt BookingTable.Arn

This lambda function is responsible for handling the POST operation, which creates a booking request. As previously discussed, POST requests are not idempotent by default, so it's necessary to use a unique idempotent identifier to safely distinguish between original and duplicated requests. This is achieved by appending X-Idempotency-Key to the HTTP headers. If the key is not appended, the lambda will not proceed with further processing and will return a 400 status code along with an appropriate message. An example of code that accomplishes this is presented below.

if (!event.headers["X-Idempotency-Key"]) {
    throw new BadRequestException("Missing idempotency key");
}

Idempotency handler

As the first step in creating an idempotent handler, we need to specify the part of the event that will serve as the key. This is achieved by referencing the previously mentioned X-Idempotency-Key header and selecting it in the IdempotencyConfig. Next, we create a processingFunction responsible for implementing the actual business logic. Then, we ensure that this processingFunction is invoked only once by wrapping it with the makeIdempotent higher-order function, and providing the configuration and the previously created persistenceStore as parameters.

import {
  IdempotencyConfig,
  makeIdempotent,
} from "@aws-lambda-powertools/idempotency";

const config = new IdempotencyConfig({
  eventKeyJmesPath: 'headers."X-Idempotency-Key"',
});

const bookingRepository: BookingRepository = new DynamoBookingRepository();

const processingFunction = async ({ body }: APIGatewayProxyEvent) => {
  if (!body) {
    throw new BadRequestException("Missing body");
  }

  const booking: PartialBooking = JSON.parse(body);

  const { error } = schema.validate(booking);

  if (error) {
    throw new BadRequestException(error.message);
  }

  const id = uuid();

  await bookingRepository.create({
    ...booking,
    id,
    status: "REQUESTED",
  });

  return id;
};

const placeBooking = makeIdempotent(processingFunction, {
  persistenceStore,
  config,
});

Creation of the actual handler

The final step is to combine all of these elements together in the actual handler. It's important to note that the context is provided to the configuration to safeguard the isolated parts of the code outside of the handler.

export const handler: APIGatewayProxyHandler = httpMiddleware(
  async (event, context) => {
    if (!event.headers["X-Idempotency-Key"]) {
      throw new BadRequestException("Missing idempotency key");
    }

    config.registerLambdaContext(context);

    return placeBooking(event);
  },
  {
    successCode: HttpStatus.CREATED,
  }
);

Creating getBooking, acceptBooking lambda functions

Both of the lambdas do not require any additional changes to make them idempotent, as it is guaranteed by the nature of the HTTP verbs used for their definitions (specifically, GET and PUT). The acceptBooking lambda function is particularly interesting because it has two responsibilities:

updating the item in the BookingRepository,
publishing a message to the SQS queue.

Code sample below demonstrates how it's achieved.

export const handler: APIGatewayProxyHandler = httpMiddleware(async (event) => {
  const id = getIdFromPathParams(event);

  // Checking if booking exists, if not findById will throw 404 error
  const { amount } = await bookingRepository.findById(id);

  try {
    await bookingRepository.updateStatus(id, "ACCEPTED");

    return messageQueue.publishMessage({ id, amount });
  } catch (err) {
    // Reverting to initial state in case of error
    await bookingRepository.updateStatus(id, "REQUESTED");

    // Throwing error so can be processed by middleware
    throw new Error((err as Error).message);
  }
});

Creating paymentProcessor lambda function

This lambda will be invoked when it pulls a message from the SQS queue, constituting the asynchronous part of the system. As the initial step, it must be defined in the YAML file so that the lambda can be created on the AWS side. Again to the lambda all necessary state store IAM permissions are assigned. What's unique is the events property where we connect created SQS queue with the way that it's invoked.

paymentProcessor:
  handler: src/functions/paymentProcessor/paymentProcessor.handler
  environment:
    BOOKING_TABLE: !Ref BookingTable
    IDEMPOTENCY_TABLE: !Ref IdempotencyTable
  events:
    - sqs:
        arn: !GetAtt PaymentProcessingQueue.Arn
  iamRoleStatements:
    - Effect: Allow
      Action:
        - dynamodb:GetItem
        - dynamodb:PutItem
        - dynamodb:UpdateItem
        - dynamodb:DeleteItem
      Resource: !GetAtt IdempotencyTable.Arn
    - Effect: Allow
      Action:
        - dynamodb:UpdateItem
      Resource: !GetAtt BookingTable.Arn

Idempotency handler

The actual lambda implementation is nearly identical to the one presented earlier. The only difference is that, because each message body contains a unique ID, there's no need to specify a unique property; it will be automatically assumed that the passed argument has a unique parameter. Since this is just a tutorial, third-party integration is mocked.

const config = new IdempotencyConfig({});

const bookingRepository: BookingRepository = new DynamoBookingRepository();

const processingFunction = async ({ id }: PaymentToBeProcessedMessage) => {
  // Here goes 3rd party payment API service call
  const paymentProcessedSuccessfully = true; // await paymentService.process(record)

  await bookingRepository.updateStatus(
    id,
    paymentProcessedSuccessfully ? "CONFIRMED" : "PAYMENT_DECLINED"
  );

  return paymentProcessedSuccessfully;
};

const handleBookingPayment = makeIdempotent(processingFunction, {
  persistenceStore,
  config,
});

export const handler: SQSHandler = async (event, context) => {
  config.registerLambdaContext(context);

  for (const record of event.Records) {
    await handleBookingPayment(JSON.parse(record.body));
  }
};

Deploying and testing

Let's start deployment process to test environment by running yarn deploy:test command (feel free to use npm or pnpm).

Deployment

Let's first try placing the booking request, notice how returned id doesn't change when X-Idempotency-Key is the same, it's a proof that lambda detects that the request is duplicated and shouldn't be processed second time but rather same result returned.

Let's now try to get placed booking to make sure that it was saved in the database.

Get booking

Final test is to try to accept the booking and see whether it was asynchronously processed by paymentProcessor function.

Conclusion

In conclusion, constructing idempotent handlers is essential for creating reliable systems that are resilient to unexpected errors and retries. However, it's important to note that this tutorial serves as a general orientation and does not delve into all the hidden complexities of ensuring idempotency (such as late arriving requests from multiple clients). Additionally, the system built here is not intended to be implemented "as is" in a production environment; rather, it serves as a simple showcase of the available tools that make handling this case easier. I hope that this tutorial was helpful!

Link to the full code repository is here.