Cloud-native architecture

August 18, 2022 by Christopher Sherman

Cloud-native architecture is a methodology for building, running, and updating software products in a way that supports speed, safety, and scale. Rather than building a monolithic product on a dedicated physical device, we create smaller, modular services distributed across physical devices, i.e., microservices. Services call one another over the network, and we combine services in a modular fashion to create products.

Cloud-native systems were made possible by three key developments:

  1. High-speed internet reduced the cost of performing network calls in a distributed system.
  2. Virtualized servers enabled increased utilization by right-sizing a service’s workload to its allocated resources, breaking the one-to-one relationship between a service and physical device.
  3. Horizontal scaling across commodity devices alleviated the need for the expensive, specialty devices required to scale vertically.

Speed

Cloud-native architecture speeds development. Once our teams agree on the interface for a given service, we concurrently build the service and its clients, possibly in different languages and frameworks.

Because cloud-native services tend to be smaller, the time required to build, test, and run an individual service is shorter. This reduces cognitive load on developers, reduces deployment time, and improves our ability to rapidly instantiate a new instance of a service when we detect a failure.

Safety

Cloud-native architectures enable rapidly building services in a concurrent fashion while maintaining the stability, availability, and durability of each service and the overall system. Because we expose many of the services over HTTP, we cannot guarantee responses to requests, nor can we guarantee a service will be available. To monitor the health of the system, we define conditions for failure, measure request patterns to establish a baseline, and alert when deviations occur. This pattern is known as observability.

Each service should isolate its own failures and tolerate failures of services of which it is a client. Fault isolation means when a service fails, it does not take down other services. Fault tolerance means gracefully handling failures of other services in a way that allows the failed service to recover and avoids triggering a cascade of failures in the system. Orchestration tools play an important role in monitoring for failures and attempting to recover to a healthy state.

Scale

Vertical-scale architectures require capacity planning to have enough equipment to handle peak demand. Systems whose workloads are highly stable benefit from vertical-scale architecture due to its lower complexity. Architecting and managing a single production machine and single redundant system is less complex than managing a fleet of commodity servers. Organizations that choose vertical-scale architecture typically own their infrastructure. By amortizing the investment over several years, they achieve a lower total cost of ownership compared to renting comparable infrastructure from a cloud provider. However, vertical-scale architecture leads to lower utilization during non-peak periods, as well as insufficient capacity if estimations come in too low (e.g., Black Friday, quarter-close).

Cloud-native architectures scale horizontally, expanding capacity by adding replicas of resources and load balancing workloads across commodity servers. Systems whose workloads burst followed by periods of low activity benefit from cloud-native architecture due to its near-infinite scalability and variable cost structure. Rather than owning the infrastructure, we rent it on-demand from a cloud provider. For highly variable workloads, renting lowers costs compared to owning a dedicated server, since the cloud provider rents its infrastructure to other customers when we are not using it. To reduce costs, we can commit to a minimum amount of spend with the cloud provider using our baseline workload. Cloud providers offer discounts to organizations willing to reserve specific capacity or commit to a minimum spend over a one-to-three-year time horizon.

Horizontal scaling

Horizontal scaling requires rethinking service design. To elastically scale across devices, we no longer hold state within a particular server, because there is no guarantee the same server handles future requests. While some architectures implement sticky sessions, this design does not account for the higher incidence of failure in distributed systems and requires extensive capacity planning to evenly distribute load in the event of long-running sessions. A better option for maintaining state is the use of an in-memory data store, such as Redis or Memcached.

Hybrid cloud

As powerful, multi-core processors and dense, high-performance flash storage become commoditized, organizations have begun adopting a hybrid-cloud architecture. In a hybrid approach, an organization owns the infrastructure required to run its baseline workloads and handles bursts through an orchestration layer seamlessly scaling services via infrastructure from a cloud provider (leverages factors four and six-through-nine of the 12-factor app methodology). This enables an organization to exploit the lower cost of owning highly-utilized infrastructure while maintaining the ability to scale when workloads burst.

Hybrid-cloud architecture requires taking a cloud-native approach to enable bursting to the public cloud. However, it is unique from purely cloud-based architecture in that it typically does not adopt so-called serverless platforms: proprietary cloud platforms offered by particular cloud providers. Instead, hybrid cloud leverages an orchestration layer, such as Kubernetes, to initiate horizontally scaling the infrastructure into the public cloud during bursts. The orchestration layer also retires public cloud resources when demand abates. While the orchestration layer adds complexity, it helps avoid the vendor lock-in inherent in most serverless platforms. Whether the additional complexities of hybrid-cloud architecture justify the potentially lower costs and increased flexibility is dependent on the organization and product.

Conclusion

Cloud-native architecture offers improved speed, safety, and scale compared with monolithic, vertical-scale architecture. However, developing services that support the horizontal scaling inherent in cloud-native architecture is a source of added complexity. Before adopting cloud-native architecture, we should consider whether our product requires scaling beyond the capacity of a single server, as well as whether we can take advantage of the modularity of microservices.

Architecture