API Gateway Responsibilities

October 1, 2019 by Chris Sherman

An API gateway sits between clients and services. While there is no precise definition of what constitutes an API gateway, functions an API gateway is responsible for typically fall into three categories: routing, aggregation, and cross-cutting functionality. These functions are applicable to many backing services, so having the gateway take responsibility for their implementation yields more focused services and interface consistency.


In their role as routers, gateways provide a single endpoint for clients to consume. When the gateway receives a request, it forwards that request on to one or more services. This decouples the task to be done from the services that accomplish that task. If the services used to accomplish the task change, clients do not necessarily have to change the request.

Routing also provides flexibility for introducing new functionality. When a services deploys a new version, we can route requests to the new version for only a subset of clients. Assuming the partial rollout goes well, we can subsequently roll out the new version to everyone.

Keep in mind that routing all traffic to the gateway introduces a single point of failure. To mitigate this risk, it’s important to design the gateway for resiliency. Resiliency involves maintaining the availability of the gateway in the face of both well-intentioned and nefarious requests. Whether it’s an avalanche of legitimate requests or a denial-of-service attack, many of the same strategies apply. These strategies include authentication and authorization, IP whitelists, caching, and rate limiting.

Performance, or how long it takes to the gateway to respond to a request, is closely related to resiliency. Assuming clients communicate with the gateway via HTTP, there is a threshold in which the gateway must send a response to prevent clients from timing out the request. To keep the gateway resilient and performant, the code we execute on the gateway should be short-lived.

To keep execution time short, a gateway often communicates with services asynchronously. This allows the gateway to handle other requests while it waits for responses. A common implementation of this paradigm is the event loop concurrency pattern. The event loop processes requests on a single thread by offloading the work to be done via asynchronous service calls. While the event loop waits for the service calls to complete it processes other requests.

To ensure high availability, Microsoft recommends deploying at least two replicas of the gateway. From there, we can scale out the gateway further based on load. It can also make sense to run the gateway on dedicated nodes in a clustered environment to prevent noisy-neighbor problems.

Despite implementing strategies for maintaining resiliency and performance, we may still choose to partition the public interface into multiple gateways. Partitioning can help organize gateway responsibilities from a logical perspective. Partitioning the gateway by API version, particular endpoints, or service criticality are common strategies. Another partitioning strategy is separating the interface based on the types of clients being served.


The average human reaction time is 250 milliseconds (a quarter of a second). Actions performed in less than 250 milliseconds appear instantaneous. For a browsing experience to feel instantaneous, reducing round-trip time is a leading consideration. When it comes to round-trip time, the contributing factor is typically latency, i.e. how fast the contents of a request travel to the server and receive a response back.

In 2012, the average round-trip time for a single Google request was 100 milliseconds. Many web pages require more than a single request. The more requests required to render a webpage, the greater the aggregate latency. Yes, the browser can parallelize some requests, but there is also an overhead cost to parallelization. We may choose to aggregate requests when a unit of work the client wants performed is not handled by a single backing service. By aggregating the unit of work into a single request to the gateway, we can reduce latency, thereby providing a better browsing experience.

Note: Aggregation is not the same as request batching. Request batching reduces the number of requests between a client and a single service across multiple units of work. Aggregation reduces the number of requests required to complete a single unit of work.

Without a gateway, clients send requests directly to each service. In addition to increased latency, sending requests directly to each service exposes potential problems such as:

  • Complex code: clients must track multiple types of endpoints and handle failures in a resilient way.
  • Coupling: client requests may require multiple services to complete a given task. Without a gateway, clients must have knowledge of individual services in order to make the proper calls. If we later decompose or aggregate services, this can cause disruption to clients.
  • Limited communication protocols: services must expose themselves via commonly used communication protocols consumable by clients.
  • Security: each public endpoint increases the potential attack surface. Responsibility for hardening publicly exposed endpoints gets spread across services.

A key to the gateway as an aggregator is the implicit assumption it can aggregate requests more efficiently than the client. For the gateway to efficiently perform this function, we can implement the following resiliency strategies:

  • Bulkhead: A ship’s hull has bulkheads. In the event the hull becomes compromised, these bulkheads ensure only the damaged section of a hull fills with water, preventing the ship from sinking. In a microservice architecture, it’s possible for excessive load or failure of a single service to cause a cascade of failures in other services. To sustain partial functionality in the event of a service failure, we can partition services based on load and availability requirements. Technologies such as Kubernetes offer the ability to specify CPU and memory limits on a container-by-container basis.
  • Circuit Breaker: a circuit breaker monitors the number of failures over a given period and decides whether to pass requests through to the underlying service or immediately return an exception. This prevents clients from overwhelming a service while that service is in a transient failure state. The circuit breaker has three states:
    1. Closed: passes requests through and monitors failures. If the failure threshold is exceeded over a given period, the circuit opens and starts a timeout timer starts. The timeout timer gives the service a grace period to attempt to recover from the failure.
    2. Open: requests fail immediately.
    3. Half-open: once the failure timeout expires, the circuit allows a limited number of requests to pass through. If any request fails, the circuit breaker switches back to the open position because it assumes the failure is still present. If the requests succeed, the circuit breaker switches to the closed position and begins monitoring failures with a fresh failure threshold.
  • Retry: when the client experiences a failure that it expects is short-lived, it can implement an automated retry operation. The retry can be immediate or it can be delayed. For delayed retries, it may choose to increase the delay between retries and completely fail the attempt after it experiences a predefined threshold of failures. Ideally the client implementing the retry logic will understand the nature of the failure and only retry for failures known to be transient. Retrying non-transient failures potentially causes further service degradation.

Additional recommendations for resiliency:

  • Locate the gateway near the backend services to limit latency as much as possible.
  • Use asynchronous requests to backing services to ensure a delay in the backend doesn’t cause performance issues at the gateway.
  • Instead of performing aggregation in the gateway, create an aggregation service behind the gateway. Request aggregation may have higher resource requirements compared to other gateway functions such as routing.
  • Time out service calls that take too long, potentially returning a partial set of data.


To simplify application development, we can offload cross-cutting functionality into the gateway. Security issues such as token validation, encryption, and SSL certificate management require specialized skills. Almost all services need functions such as authentication, authorization, logging, and monitoring. Some of these functions are not easily packaged and configured as dependencies, so it may be better to consolidate them into the gateway to reduce overhead and the chance for errors.

Terminating inbound SSL connections is a common function of the gateway. This pattern keeps data encrypted between the client and the gateway while allowing unencrypted traffic to flow between internal services. This alleviates the need to distribute and maintain certificates between backing services. The core engineering team can focus on application features while alleviating the need for security experts to focus on authentication, authorization, and network monitoring at every level of the architecture.

Offloading functions such as logging and monitoring to the gateway provides a level of consistency. Even if an individual service is not properly instrumented, the gateway ensures we have a minimum level of logs available. The gateway can also take care of more specialized monitoring activities such as rate limiting.

Additional functions commonly handled by the gateway include:

  • Response caching.
  • GZIP compression.
  • Serving static content.
  • Protocol conversion.

Offloading functionality to the gateway is a balancing act. As discussed in the routing section, we must ensure the gateway maintains a reasonable level of performance and is resilient to failure. Practical recommendations for offloading include:

  • Only offload features used by the entire application. Limiting the gateway to cross-cutting concerns reduces the risk for myriad, long-running functions that cause the gateway to become a bottleneck.
  • Never offload business logic to the gateway. Business logic is the responsibility of the service accomplishing a given task.
  • To track transactions from the gateway to the services doing the work, generate a correlation ID. Append the ID as a custom HTTP header that services can append to their events.


API Gateways play a critical role in microservices architecture, acting as a mediator between clients and services. While there is no one-size-fits-all approach for which responsibilities a gateway handles, at a high level gateways handle routing, request aggregation, and cross-cutting concerns. Because gateways act as the single interface for client requests, it’s critical to ensure an acceptable level of performance as well as resiliency to backing service failures.


Microsoft. (2018, October 22). Using API gateways in microservices. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/microservices/design/gateway

Microsoft. (2017, June 22). Gateway Routing pattern. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/patterns/gateway-routing

Microsoft. (2017, June 22). Gateway Aggregation pattern. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/patterns/gateway-aggregation

Microsoft. (2017, June 22). Gateway Offloading pattern. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/patterns/gateway-offloading

Microsoft. (2017, June 22). Bulkhead pattern. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead

Microsoft. (2017, June 22). Circuit Breaker pattern. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker

Microsoft. (2017, June 22). Retry pattern. Retrieved from https://docs.microsoft.com/en-us/azure/architecture/patterns/retry

PubNub Staff. (2015, February 9). How Fast is Realtime? Human Perception and Technology. Retrieved from https://www.pubnub.com/blog/how-fast-is-realtime-human-perception-and-technology/

Grigorik, Ilya. (2012, July 19). Latency: The New Web Performance Bottleneck. Retrieved from https://www.igvita.com/2012/07/19/latency-the-new-web-performance-bottleneck/

keycdn (2018, October 4). What Is Latency and How to Reduce It. Retrieved from https://www.keycdn.com/support/what-is-latency

Cloud Architecture