Distributed Systems 101 (based on Understanding Distributed Systems)

I recently finished reading "Understanding Distributed Systems", by Roberto Vitillo. In this article, I'll provide an overview of distributed systems, drawing from the knowledge and principles outlined in Vitillo's book and from my daily work as an SDE on Amazon.

Many of the topics approached here deserves a whole article on it, please keep in mind that this is a introduction to distributed systems, but I will link other articles and researches on the end and through the text so you can dive deep on each subject.

In the end, I hope that you had a brief overview of many important factors of distributed applications. I will also link many references through the article and "next steps" for you to dig into, as it is impossible to cover everything.

How systems communicate with each other?

Nothing functions in isolation, whether you're accessing a website or sending an email, almost every action on a computer involves communication with another system. We will explore how this communication works as it is an important topic to understand before digging into distributed systems.

Physical communication between services

Sometimes we forget about this but there is something (data) physically traveling from one computer to another, whether it is by WiFi for example or through Ethernet cables. For this data exchange to happen, rules and formats were set, those rules and formats are called protocols.

Protocols act as the language of communication between systems, defining the rules and formats for data exchange. HTTP (Hypertext Transfer Protocol) governs communication over the web, allowing clients to request and receive resources from servers (Every time you open a webpage, you're using HTTP behind the scenes!). Similarly, RPC (Remote Procedure Call) enables programs on different computers to talk to each other by invoking procedures on remote machines.

But before data can travel using protocols, it needs an addressing system. This is where IP (Internet Protocol) addresses come in. An IP address is a unique identifier assigned to each device on a network, similar to a mailing address in the real world. When you request information from a website, your device uses its own IP address to communicate with the website's server IP address.

Digital Cartography: mapping services

Before moving on, I want to take a step back and talk about DNS, which is extremely important for the communication of services. IP addresses are not very user-friendly, imagine having to remember a long string of numbers for every website you want to visit.

DNS solves this problem by providing a way to map human-readable domain names (like "google.com" or "instagram.com") to IP addresses. So, while protocols define the rules and formats for data exchange, and IP addresses provide the addressing system for devices on a network, DNS serves as the intermediary that maps human-readable domain names to machine-readable IP addresses, making it easier for us to navigate the vast landscape of the internet.

What happens when communication rate increases?

Imagine many systems that you use daily, like Instagram, Amazon, Uber, AirBnB, Steam, how they guarantee resiliency? If Amazon recommendation system goes offline, probably the rest of the website will still be working. How they guarantee scalability? As all of those companies serve millions of customers, they need to be prepared for a insane amount of load. Do they have one big computer to process everything? Nope, imagine if this computer suddenly stops working.

Of course, they were not always this big, through the time, their systems started facing challenges on scalability, throughput, latency and so on. And, as we have a limit to scale memory, CPU, network and disk size for example, there is the need to rely on another technic to scale the system. Distributed systems comes as one for doing so, by employing techniques such as partitioning, replication, load balancing and caching.

Instead of having one enormous large server (which would be unfeasible), those companies have many computers connected and linked to each other over the network which appears as a single computer to the user.

Here comes the need for distribution!

You have a system that is getting millions of accesses, whose infrastructure is already scaled, the performance of it is not as good as you would want and you still need to support even more customers. What do you do? You distribute it.

Scalability: how do we scale?

There are several challenges when scaling a system, you want to minimize the code changes, minimize the cost for doing so, minimize single points of failure (i.e. if a single part of you application stops, the whole application stops). To solve these challenges when load and data volume increase, we distribute the application in pieces.

Services: the foundation

A service within a distributed system typically represents a specific domain or business capability. Each service operates independently, and if one service goes down, the rest of the application can continue running (other services wont be able to communicate with the offline service, that's for sure, but we will talk more about it latter).

Communication: you distributed the application in parts, how each part will communicate with each other?

In a distributed architecture, effective communication between services is essential for the overall functionality of the system. One common approach to enable communication between services is through messaging protocols such as HTTP, AMQP (Advanced Message Queuing Protocol), or gRPC (Google Remote Procedure Call). It is literally, a service calling each other.

Event-Driven Architecture

Using events to communicate with other service is another widely used approach. In an event-driven architecture, services communicate through the exchange of events or messages. When significant events occur within a service, such as the creation of a new user or the completion of a task, the service calls another service that is responsible to only store those messages and call another service to actually process these stored messages.

Services interested in these messages can subscribe to a so called message broker and react accordingly, enabling loosely coupled and scalable communication between services. You basically have an intermediate service that orchestrates the communication. See the image bellow for a representation of an event-driven architecture.

Event-driven architecture example diagram

Horizontal Scaling: the foundation of scalability

You divided your application into pieces, but how exactly it will be able to support more customers? Is dividing it enough? No! You need to scale the application horizontally, by adding more machines or instances to the network, distributing the workload across multiple nodes. Although it might seem easier to scale vertically, there is a limit, cost and many disadvantages like single points of failure when doing so.

However, horizontal scaling also comes with its own challenges, such as ensuring data consistency across distributed nodes and orchestrating load balancing efficiently. As you might be wondering, proper design and implementation of the horizontal scaling is essential for achieving scalability in distributed systems.

Load Balancing

Imagine that you have a service named InventoryService, which is running on 10 machines, if another service wants to call it, how will you know which of the machines to use in order to process this request? The load balancer of the InventoryService will do it for you!

In a distributed system, load balancing plays a crucial role in distributing incoming requests among the machines efficiently. It ensures that the workload is evenly distributed across all available machines, preventing overwhelming some instances while others remain underutilized.

Taking for example the InventoryService, the load balancer is another service, running on another machine, responsible to receive the requests and deciding which of those 10 machines will end up processing the request. Oh wait! The load balancer can become a single point of failure, if it goes down, none of the requests will go through any of the 10 machines! How do we resolve it? Well, that is a topic for another article that I will be writing in the future. For now, lets understand how exactly it "balances" the load evenly.

Round Robin: this method assigns each incoming request to the next available server in a sequential order. It ensures that all servers receive an equal number of requests over time, effectively distributing the load evenly.
Least Connections: this algorithm direct incoming requests to the server with the fewest active connections at the time of the request. Tt aims to distribute the workload based on the current load of each server.
Weighted Round Robin: mainly used in scenarios where not all servers are equal in terms of capacity or performance, it assigns a weight to each server. Servers with higher weights receive a proportionally larger number of requests.
IP Hashing: determine which server will handle the request by using a hash function based on the client's IP address. Requests from the same client are consistently routed to the same server.
Least Response Time: load balancer will monitor the response times of each server and direct requests to the server with the fastest response time.
Dynamic Load Balancing: dynamically adjusts the load balancer strategy based on real-time metrics such as server health, current load, or network conditions. This ensures efficient resource utilization and can handle sudden spikes in traffic more effectively.

Bellow is an image showing a service with running on three machines, the load balancer will receive the request and be responsible to decide which machine to choose in order to process it.

Load balancer example diagram

Data Consistency and Replication

You now have a system running on multiple machines, as talked on the horizontal scaling section, what if this system stores data on the machines? An example of a system that does so is DynamoDB, a distributed database service. How do we guarantee that all the machines sees the same data (i.e. data is consistent)? We use replicas, which are redundant copies of data in a system.

Replicating data involves maintaining copies of data across multiple replicas. Updates to data are propagated to all replicas to ensure consistency. However, ensuring consistency introduces challenges such as synchronization and coordination of updates, because data is not stale and changes through the time. This brings the issue of defining the level of consistency we want. Bellow, some of them are explained.

Strong consistency

In a service with strong consistency, every read receives the most recent write or an error. The system guarantees that all replicas will reflect the most recent write before any subsequent reads are allowed. This ensures that all subsequent reads will return the updated value.

Although this is the best consistency you can have in terms of reading the most recent data, it might hurt your performance, as for any write, all of the replicas must be updated before the data is available to be read. Strong consistency may impose constraints on scalability, as the number of replicas increases, the complexity of maintaining strong consistency also grows.

Example: imagine a chat application where users can send and receive messages in real-time. In a system with strong consistency, when a user sends a message, all replicas of the chat data across different servers must be immediately updated with the new message. Subsequent reads by any user should then reflect the most recent message sent, ensuring that all users see the same conversation at any given time.

Sequential consistency

Sequential consistency is weaker than strong consistency but stronger than eventual consistency. It implies that the order of operations appears consistent to all processes. However, it relaxes the requirement that all processes see the same order of operations for concurrent writes.

Example: imagine a banking system in which two customers, Alice and Bob, initiate withdrawals concurrently from their joint account. Although their withdrawal requests are processed by different servers, sequential consistency demands that all servers agree on the order in which these transactions occurred. If Alice's withdrawal is processed before Bob's, this order must be observed consistently across all servers.

However, due to propagation delays, account balance updates might lag across replicas, leading to temporary discrepancies. Despite these delays, sequential consistency ensures that eventually, all replicas converge to a state where the order of transactions is consistent across the system, preserving the chronological sequence of events.

Eventual consistency

This is the weakest form of consistency among the three. Eventual consistency allows for temporary inconsistencies among replicas, however, there is a guarantee that, given enough time, all replicas will eventually converge to a consistent state.

This approach assumes that achieving immediate consistency across distributed systems can be impractical or inefficient due to factors such as network latency and the volume of data being replicated. Eventual consistency prioritizes availability over immediate consistency.

This model is commonly employed in distributed databases, like DynamoDB, content delivery networks, and other distributed systems where trade-offs between consistency, availability and performance needs to be carefully balanced to meet the requirements.

Example: imagine a social media platform where users can post updates. After a user posts an update, different replicas across servers might not immediately reflect the new post. However, over time, as the updates propagate through the system, eventually, all replicas will converge to a consistent state where every user sees the posted update. (Yep, the exact same example as on sequential consistency)

In practice, the user may open the platform, see the post, refresh the page, stop seeing it, then refresh the page again and start seeing the post again. This can happen until all replicas are consistent and received the post update successfully.

Replication and consistency are important topics inside distributed systems (and also hard to understand on the first read), replicating data provides many benefits like keeping the data geographically close to the users to minimize latency and increase availability in case of certain machines stops working.

Here are some more references for you to dig into:

Partitioning

While replications involves creating identical copies of data across multiple machines, in which each machine stores a complete copy of the dataset. Partitioning involves dividing the dataset into smaller subsets and distributing these subsets across different machines. Each machine stores only a portion of the overall dataset.

Why we need partitioning?

Partitioning enables systems to handle larger datasets and higher workloads by distributing the data across multiple machines. This allows for better resource utilization and accommodates growing demands without overwhelming individual machines.

Partitioning also enhances fault tolerance by reducing the impact of individual machine failures. Even if one machine goes down, the system can continue to function using the data stored on the remaining machines. Additionally, partitioning often includes replication within partitions, ensuring that each subset of data is redundantly stored across multiple machines.

Availability: how to stay up and running?

Availability is understood as the percentage of time the system is available and operational, it is typically calculated in number of 9s. 1 nine = 90% availability, 2 nines = 99% availability, 3 nines = 99.9 % availability, 4 nines = 99.99% availability, and so on

Availability in distributed systems involves not only ensuring that individual components remain operational. It extends to the ability of the entire system to continue to providing seamless and uninterrupted service to users. There are many mechanisms used to achieve a high availability, such as fault tolerance ones, replication (that we also talked about previously) and much more.

Fault Tolerance

100% availability is physically impossible, that is why we apply fault tolerance and recovery mechanisms as it is not a matter of "if", but "when" the system will have a failure. We can't prevent failures entirely, but we can design systems to be resilient and quickly recover from them. Bellow some of the techniques for doing so are explored.

Replication (again?)

Yep, again! Replication helps on having a high availability rate by providing redundancy, ensuring that if one replica fails or becomes unavailable, others can continue to serve requests. This improves the system's reliability and availability.

The load balancer component that we talked before is also very important here, as it enhances availability by ensuring that replicas doesn't becomes overwhelmed with requests.

Failure detection and Recovery

There are many ways to detect failures, a very common one is the service periodically sending signals to indicate it is alive. Those are called heartbeat signals. Missing signals indicates a potential failure.

Services should also monitor and track metrics like CPU usage and memory availability. It helps on identifying issues before they erupt into complete failures.

Recovery: there are some techniques that can be used for recovery, this step can also be automated depending on the scenario, with backup servers for example. If it's not possible to recover automatically, the system should trigger an alert for manual intervention.

Example: an e-commerce website uses a distributed system with multiple servers handling customer requests. A failure detection and recovery system continuously monitors these servers. If a server shows high CPU usage or stops sending heartbeats, the system detects the failure. It then isolates the failing server and initiates a failover, automatically switching traffic to a healthy server. This ensures that customer experience is minimally impacted, and purchases can continue uninterrupted.

Graceful Degradation

Graceful degradation is a strategy employed to maintain essential functionalities even when certain components experience failures or become unavailable. Unlike failure detection and recovery mechanisms, graceful degradation emphasizes maintaining a baseline level of service quality even under adverse conditions. Also, it is important to note that both can and should be used together.

As an example of graceful degradation in use to help on providing high availability, imagine an online learning platform, if a video lecture experiences buffering issues due to high traffic, the platform can offer alternative delivery methods like downloadable transcripts or audio-only versions to ensure students can still access the learning material.

Monitoring: ensuring visibility in distributed systems

Maintaining visibility into performance, health, and behavior is essential for ensuring reliability and timely issue resolution. Monitors plays an important role in achieving this objective, offering approaches to gain insights into system behavior and performance.

Metrics

Metrics are quantitative measurements that capture different facets of system behavior and performance. In the context of distributed applications, metrics can include mainly:

Latency: measuring the time it takes for requests to be processed or responses to be delivered.
Throughput: tracking the rate at which requests are handled by the system.
Error Rates: monitoring the frequency of errors or failures occurring within the application.
Resource Utilization: assessing the usage of CPU, memory, disk space, and other resources.

Dashboards

Dashboards provide visual representations of metrics, allowing stakeholders to monitor the health and performance of distributed applications in real-time. Key characteristics of effective dashboards include:

Aggregation: aggregating metrics from multiple sources and components of the distributed system to provide a comprehensive view of its health.
Visualization: utilizing charts, graphs, and other visualization techniques to present metrics in an intuitive and actionable manner.
Alerting: integrating alerting mechanisms into dashboards to notify stakeholders when predefined thresholds or anomalies are detected.

Logs

Logs play a crucial role in capturing detailed information about system events, errors, and transactions, helping in debugging and troubleshooting efforts. Key aspects of logs include:

Event Logging: recording chronological records of system events, errors, and transactions, offering a timeline of system activity.
Error Tracking: identifying and logging errors or exceptions encountered within the application, assisting in diagnosing and resolving issues.
Transaction Monitoring: tracking the flow of transactions through the system, enabling the reconstruction of transaction paths for analysis and auditing purposes.

A robust monitoring strategy for a distributed system incorporates the use of metrics, dashboards, and logs to ensure comprehensive visibility into system performance and behavior, facilitating proactive monitoring, troubleshooting, and optimization efforts.

What's next?

We have come to the end of this article, but there is much more to study on distributed systems. Here are some next steps, books and other articles for you to read. Also, feel free to leave any comments or suggestions!

Understanding Distributed Systems (book by Roberto Vitillo)
Designing Data-Intensive Applications (book by Martin Kleppmann)
Why Distributed Computing?
Fallacies of distributed computing
On Designing and Deploying Internet-Scale Services
The Google File System
Time, Clocks, and the Ordering of Events in a Distributed System
Impossibility of Distributed Consensus with One Faulty Process

Leonardo Luís Dalcegio