Backend Weekly
Posts
CAP Theorem in System Design

CAP Theorem in System Design

In this episode, I will explain the CAP Theorem in System Design. You will Learn everything you need to know about Consistency, Availability, etc., and other trade-offs such as CP systems, CA systems, and AP systems.

Solomom Eseme
June 22, 2024

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today’s issue is brought to you by Masteringbackend → A great resource for backend engineers. We offer next-level backend engineering training and exclusive resources.

Before we get down to the business of today. Part 13 of Understanding System Design (Last Chapter).

I have a special announcement for you: You will love this one.

Finally, I’m starting my Advanced Backend Bootcamp 2.0. Since I launched the first one, I have been preparing for this with my team for over a year.

This bootcamp is different. It’s a 4-6 month in-depth backend engineering Bootcamp diving deep into backend engineering and building real-world backend projects.

Not to say much, because I’m fully prepared to turn you into a great backend engineer.

Click here for all the information you need.

Introducing GBE Podcast

I am introducing my new podcast, The “Great Backend Engineer (GBE)”.

To finish the database discussion, I invited Raul Junco to my Podcast to discuss the importance of databases to backend engineers.

Listen to the Podcast.

Now, back to the business of today.

In previous editions, we explored different topics under Database, such as understanding databases, ACID Compliance, database replication, database sharding, and database indexing.

In this episode, I will explain the CAP Theorem in System Design. You will Learn everything you need to know about Consistency, Availability, etc., which will conclude our System Design series.

This ends the System Design series. However, the Advanced System Design Course for Backend Engineers is launching soon.

Overview of CAP Theorem

Trade-offs are important in software engineering. You can't get everything right.

No one has.

Whether you're a senior engineer or CTO, you must understand that no system is perfect, and no software is without fault.

That's why Eric Brewer introduced the CAP Theorem in 2000, which gave you insight into how to think about the trade-offs involved in designing and building distributed systems.

Let’s take an example:

Have you ever seen an advertisement for services like graphic design or house painting that begins with a catchy headline: “Cheap, Fast, and Good: Pick Two”?

Cheap represents the financial aspects of the project.

Fast represents the speed at which the tradesperson completes the project, and quality signifies craftsmanship and excellence.

The tradesperson says it’s unreasonable for a client to expect all three. There is a trade-off, and the client should choose what to forgo:

Fast and good = expensive
Fast and cheap = low quality
Good and cheap = takes time.

So you have to make a choice.

In the tradesperson analogy, you face a trade-off similar to the CAP theorem, which uses the same logic in distributed systems.

What is the CAP Theorem?

The CAP theorem, also known as Brewer's theorem, provides a way of considering trade-offs when designing and building a distributed system. The theorem states that a distributed system can have two guarantees at most.

From our analogy above, a trade-off must happen. Therefore, you must choose between different properties of the CAP theorem in your system design.

Properties of the CAP Theorem

A distributed system is made up of multiple nodes storing data and can only deliver two of three desired characteristics:

Consistency(C)
Availability(A)
Partition tolerance(P)

Consistency

Consistency means that every client read request receives the most recent write. All clients see the same data no matter which node they connect.

However, for eventual consistency, consistency is loose to some degree, meaning that it is a guarantee that the client will eventually see all the same data in all the nodes at some point in the future.

For this to happen, the data should be replicated to all other database nodes whenever a database node writes. In other words, system updates are immediately reflected in all nodes, and the data should be consistent.

A screenshot of Consistency

Availability

Availability means that any client requesting should always get a valid response, even when some nodes are down or unavailable.

A screenshot of Availability

If a user sends a request even though we don’t see specific network components, the system is available and functioning. Every request receives a response, whether successful or not. This aspect is crucial to availability, as it guarantees that users always get feedback and are not left hanging.

Partition Tolerance

A partition is a communication break that occurs when a connection between two or more nodes in a distributed system is lost or delayed.

Partition tolerance means that the nodes cluster should continue functioning despite several communication breakdowns between nodes.

A distributed system should guarantee partition tolerance and can gracefully recover from partitions once they heal.

A screenshot of Partition Tolerance

The diagram addresses network failure, which is a common cause of partitioning. However, it also shows that the system is designed to function even with network failures, which is the key characteristic of Partition Tolerance.

Trade-Offs in the CAP Theorem

As we design our systems, we want all 3 characteristics of the CAP theorem. But since we can't have everything, we look for trade-offs by understanding our system and choosing at most two combinations that best suit our system and customers.

Which are:

CA System (Consistency and Availability)
CP System (Consistency and Partition Tolerance)
AP System ( Availability and Partition Tolerance)

CA System

In the CA system, we prioritize Consistency and Availability over Partition Tolerance. However, it's essential to understand that true CA systems are theoretical rather than practical within distributed systems because real-world distributed systems must contend with network partitions and failures.

Practical Implications:

Distributed systems cannot be strictly CA because they need to handle partitions (P). When a partition occurs, a system that claims to be CA would have to either:

Sacrifice Consistency: Allow some nodes to have outdated or inconsistent data to maintain availability.
Sacrifice Availability: Deny requests until the partition is resolved to maintain consistency.

Examples

Traditional Relational Databases (Single Node): Traditional databases like SQL databases on a single server can be considered CA because they are consistent and available as long as there is no partition (since there is no network partition within a single node).
Distributed CA-like Systems: Some systems can strive to be CA under the assumption of a reliable network where partitions are extremely rare or handled quickly. For example, systems within a well-maintained, high-speed local network might try to operate under CA assumptions.

CP Systems

A CP system prioritizes Consistency and Partition Tolerance over Availability. Here's a detailed explanation:

Key Characteristics of CP Systems:

Consistency (C): Every read receives the most recent write or an error. Once a write is acknowledged, the system ensures that subsequent reads reflect that write. The data appears as if there is a single, up-to-date copy, even if there are multiple replicas.
Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. This means the system can handle network partitions and maintain its operation, though not necessarily its availability.
Reduced Availability (A): To ensure consistency, some system parts may become unavailable during a network partition. If a network partition occurs, the system may deny requests to maintain consistency across all nodes, sacrificing availability.

Practical Implications:

In a CP system, if a network partition occurs, the system must choose to either:

Deny writes: To ensure that all replicas remain consistent, the system might reject write operations during the partition.
Deny reads: To ensure that no stale data is read, the system might reject read operations until it can ensure consistency.
Both deny reads and writes: Sometimes, the system may become partially or fully unavailable to ensure that all data remains consistent when the partition is resolved.

Examples of CP Systems:

HBase: HBase is a distributed, scalable, big data store modeled after Google's BigTable. It prioritizes consistency and partition tolerance, making it a CP system. It may refuse to process some operations during a network partition to ensure data consistency.
ZooKeeper: Apache ZooKeeper is a coordination service for distributed applications that provides mechanisms like configuration management, synchronization, and naming. It ensures strict consistency; during a network partition, some operations may become unavailable to maintain consistency.
Etcd: Etcd is a distributed key-value store that guarantees strong consistency and is often used for configuration management and service discovery in systems like Kubernetes. It prioritizes consistency and partition tolerance, making it a CP system.

Trade-offs:

High Consistency: CP systems ensure that all nodes see the same data simultaneously, which is crucial for applications where stale or inconsistent data is unacceptable.
Partition Tolerance: CP systems can handle network failures and partitions, maintaining consistency by potentially sacrificing availability.
Reduced Availability: During network partitions, CP systems may become partially or fully unavailable to maintain consistency, impacting user experience and system responsiveness.

AP Systems

An AP system in the context of the CAP theorem is a distributed system that prioritizes Availability and Partition Tolerance over Consistency. Here's a detailed explanation:

Key Characteristics of AP Systems:

Availability (A): Every request receives a (non-error) response, even if it does not contain the most recent write. This means the system is designed to be operational and always respond to read and write requests, regardless of the network's state.
Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. This ensures that the system can handle network partitions and continue to function.
Eventual Consistency (C): AP systems provide eventual consistency rather than guaranteeing immediate consistency. While a read request may not always return the most recent write immediately, all updates will eventually propagate through the system, and all replicas will become consistent over time.

Practical Implications:

In an AP system, if a network partition occurs, the system remains available by:

Allowing reads and writes: All nodes can continue reading and writing even if some nodes cannot communicate with each other.
Handling conflicts: The system must handle potential data conflicts and reconcile them once the partition is resolved. This can be done using various strategies, such as versioning, conflict-free replicated data types (CRDTs), or application-specific conflict resolution mechanisms.

Examples of AP Systems:

DynamoDB: Amazon DynamoDB is a key-value and document database with high availability and partition tolerance. It uses techniques like replication and partitioning to ensure that the database remains available and resilient to network partitions.
Cassandra: Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers without a single point of failure. It provides high availability and partition tolerance, using eventual consistency to ensure all replicas converge over time.
Riak: Riak is a distributed NoSQL database emphasizing availability and partition tolerance. It uses techniques like vector clocks and hinted handoff to ensure data remains available and eventually consistent.

Trade-offs:

High Availability: AP systems ensure that the system remains operational and responsive, even in the face of network partitions. This is critical for applications where downtime is unacceptable.
Partition Tolerance: AP systems can handle network failures and partitions, maintaining availability and ensuring the system continues functioning.
Eventual Consistency: To achieve high availability and partition tolerance, AP systems relax the requirement for immediate consistency. This means there can be a delay before all nodes reflect the most recent writes, which may lead to temporary inconsistencies.

Use Cases:

AP systems are suitable for applications where high availability is more important than immediate consistency, such as:

Social Networks: Where updates (e.g., posts, likes) can tolerate some delay before becoming consistent across all nodes.
E-commerce: Where product catalogs and user profiles need to be highly available, even if there are temporary inconsistencies.
Content Delivery Networks (CDNs): Where content must be served quickly and reliably, even if it takes some time for updates to propagate to all nodes.

Real World Example

Let’s say you have developed a system that has become quite popular. To effectively serve clients' requests, you scale horizontally by increasing 2 database nodes(n2 and n3).

Ideal situation

Data written on n1 is replicated to the other nodes(n2 and n3). This achieves both data consistency and system availability, making it an example of a CA (consistency and availability) system.

A CA system cannot exist in real-world applications since network failure is unavoidable. Network partitioning generally has to be tolerated.

Normal situation

In the real world, network partition cannot be avoided, and when it occurs, we must choose either consistency(CP) or availability(AP). When n3 goes down, it cannot communicate with n1 and n2, and if clients write to n1 and n2, the data cannot be propagated to n3. This means n3 will have stale data when a user requests this node.

Like a bank, a CP (consistency and partition tolerance) system will choose consistency over availability. To avoid data inconsistencies among the three nodes, the system must block all write operations to nodes n1 and n2, making them unavailable.

Bank systems have extremely high consistency requirements. For instance, a bank must display a client's latest bank balance to avoid overdrawing.

If data inconsistency occurs due to network partition, the system should be unavailable and return an error until data inconsistencies are resolved.

An AP (availability and Partition Tolerance) system will keep accepting client data reads even when the data is stale. n1 and n2 data nodes will keep accepting data writes, and the data will be synced once the partition is resolved.

This architecture is popular in systems like social media platforms where users’ uninterrupted access and interaction with the system are prioritized. Social media users can continue browsing the content even if the data is stale.

That will be all for this week. I like to keep this newsletter short.

Today, I discussed the CAP Theorem; you learned about Consistency, Availability, Partition Tolerance, and other trade-offs such as CP systems, CA systems, and AP systems.

This ends the System Design series. However, the System Design for Backend Engineers Course is launching soon.

Next week, I will start exploring Software Testing.

Don’t miss it. Share with a friend

Did you learn any new things from this newsletter this week? Please reply to this email and let me know. Feedback like this encourages me to keep going.

See you on Next Week.

Remember to join the Advanced Backend Bootcamp 2.0, which will start soon. You can reply to this email with any questions or concerns.

Top 5 Remote Backend Jobs this week

Here are the top 5 Backend Jobs you can apply to now.

👨‍💻 Backbase
✍️ Backend Engineer
📍Remote, Amsterdam, Netherlands
💰 Click on Apply for salary details
Click here to Apply for this role.

👨‍💻 Unity Technologies
✍️ Backend Engineer
📍Remote, Tel Aviv-Yafo, Israel
💰 Click on Apply for salary details
Click here to Apply for this role.

👨‍💻 SFOX
✍️ Backend Software Engineer, Trading Infrastructure
📍Remote
💰 Click on Apply for salary details
Click here to Apply for this role.

👨‍💻TransferGo
✍️ Backend Engineer - Cards & Integrations
📍Remote, London, United Kingdom
💰 Click on Apply for salary details
Click here to Apply for this role.

Want more Remote Backend Jobs? Visit GetBackendJobs.com

Backend Engineering Resources

Whenever you're ready

There are 4 ways I can help you become a great backend engineer:

1. The MB Platform: Join 1000+ backend engineers learning backend engineering on the MB platform. Build real-world backend projects, track your learnings and set schedules, learn from expert-vetted courses and roadmaps, and solve backend engineering tasks, exercises, and challenges.

2. The MB Academy: The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. Backend Engineering Courses: Access a catalog of backend courses, development courses, advanced backend engineering courses, nests courses, and backend web development courses. Next-level Backend Engineering courses and Exclusive resources.

4. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying. Lastly, you can hire backend engineers anywhere in the world.

LAST WORD 👋

How am I doing?

I love hearing from readers, and I'm always looking for feedback. How am I doing with The Backend Weekly? Is there anything you'd like to see more or less of? Which aspects of the newsletter do you enjoy the most?

Hit reply and say hello - I'd love to hear from you!

Stay awesome,
Solomon

I moved my newsletter from Substack to Beehiiv, and it's been an amazing journey. Start yours here.

Reply

or to participate.