Understanding System Design - Part 2

In this episode, I elucidate the 3 main system design performance metrics

Hello “👋”

Welcome to another week, another opportunity to become a Great Backend Engineer.

Today’s issue is brought to you by MasteringBackend, a great resource for backend engineers. We offer next-level backend engineering training and exclusive resources.

Before we get down to the business of today. Part 2 of Understanding System Design.

I have a special announcement: You will love this one.

Masteringbackend is launching the beta version of its platform, which allows you to learn backend engineering in one place. The platform will help you grow your backend engineering career and turn you into a great backend engineer.

Here are some of the features:

  1. Roadmaps => MB Roadmap enables a structured-based learning approach for Backend engineers.

  2. Project Land => MB Projects enables backend engineers to use a learn-by-building model. Build real-world backend projects without coding the frontend.

  3. Backend Portfolio => Create and manage your backend portfolio with many real-world backend projects.

  4. BackLand => Learn backend engineering by solving challenges in a gamifying way.

Sound interesting?

The beta version is out for testing, reviews, and feedback.

Reply to this email if you find anything worth reporting or if there is more feedback to help us improve.

Now, back to the business of today.

In the previous edition, I discussed the fundamentals of System Design. In this issue, I will continue from where we stopped on understanding all the system design performance metrics.

As elaborated on in the first episode of this series on System Design, there are 3 main metrics used to measure the performance of a system. Viz:

  1. System Reliability

  2. System Availability

    1. Availability Vs. Reliability

  3. System Efficiency

The first one is System Reliability, which was already covered in the previous episode. Check it out here if you haven’t read that one yet.

Now, let’s start with System Availability:

System Availability

System availability is the probability that a system works properly when it’s requested for use. It means that the system is available for use as a percentage of scheduled uptime and is not due to problems or other interruptions that are not scheduled. It’s a measure that a system has not failed or is undergoing repair when it needs to be used.

Availability Calculation

Let’s say that your system runs for 24 hours a day. The system had a one-hour unplanned downtime because of a breakdown. The system availability can be calculated as follows:

availability % = (available time / total time) * 100
availability % = (23 hours / 24 hours) * 100 = 95.83%

The system availability was 95.83%. This might seem like a high score, but in software, 95.83% availability is not good.

A system availability of 90% equates to 36.5 days per year. Even with an availability of 95%, an online marketplace like Amazon will lose billions of dollars annually.

Cloud computing services like Azure, AWS, and Google Cloud have Service Level Agreements (SLAs) to commit to system reliability and availability to define standards that will keep your systems running smoothly despite disruptions.

The five nines in system availability mean that your system is available at 99.999%, a common SLA between companies.

Measuring system availability when using the microservices architecture can be challenging since some components might be less available than others. If one fails, this can be overcome by having redundant(backup)/replicated servers.

A load balancer can detect when a server fails and use a backup server, increasing availability.

Availability Vs. Reliability

What is the difference between reliability and availability?

Availability measures the percentage of time the system is in an operable state, while reliability measures how long an item performs its intended function without breaking down.

However, reliability and availability go hand in hand. An increase in reliability translates to an increase in availability. It’s important to remember that both metrics can produce different results.

You might have a highly available machine that is not reliable.

For example, a commercial blender operates close to its maximum capacity. The Motor can run for several hours daily, implying high availability.

However, it may need to cool every half an hour to resolve operational problems. Despite its high availability, the blender is not a highly reliable piece of equipment.

Best Practices to Improve System Availability and Reliability

The goal of high availability is to minimize system downtime and the time needed to recover from an outage. This can be achieved by:

  1. Build with failure in mind - Always plan for your application and services to fail. As the CTO of Amazon, Werner Vogels, says, “Everything fails all the time.” Using design constructs such as simple try-catch methods, retry logic, and circuit breakers allow you to catch errors. This will allow you to limit the scope of the problem, and your app will continue working even if parts of the application fail. Circuit breaker patterns are useful for handling dependency failures since they can greatly reduce the impact of dependency failures on your system.

  2. Always think about scaling - An application that generates a certain amount of traffic today might generate much more traffic sooner than anticipated. As you build your app, don’t build it for today’s traffic but for tomorrow’s. This can be achieved by building an application that allows you to add additional servers and easily increase your databases' size and capacity when needed.

  3. Reduce single points of failure. Eliminate all single points of failure from your application infrastructure. Since all hardware fails at some point, eliminate the impact it will cause on your application. This means backing up everything: servers, routers, switches, power sources, etc. that you anticipate.

  4. Monitor the application—Ensure it is instrumented to see how it performs. Instrumentation tools monitor the health of servers, the performance of applications and services, synthetic testing(which examines in real time how the app is working from the user's perspective), and alert appropriate personnel when problems occur so that they can be quickly resolved.

  5. Predictably respond to downtime—Monitoring issues is useless unless prepared to act on them. You should establish processes your team follows to diagnose and fix common failure scenarios. The standard processes should be prepared so that during a downtime/outage, the owner of the related service is alerted to restore the service quickly.

System Efficiency

System efficiency measures how well a system works. The two metrics used to measure system efficiency are Latency and Throughput.

Throughput

Throughput refers to how much data can be processed within a specific period.

It measures the quantity of data sent or received within a unit of time. The unit used to measure throughput is megabits per second(Mb/s).

For example, 1TB of data can be processed per hour.

In a client-server system, client throughput is the number of responses a client can get for requests made, while Server throughput measures how many requests per time(usually in seconds) a server can process.

Latency

Latency is a measure of delay. The unit used to measure latency is Millisecond.

In a client-server system, there are two types of latency:

  1. Network Latency - It’s the time it takes for data/packets to travel from a client to the server. The time can be measured as one way or as a round trip.

  2. Server latency is the time the server takes to process and generate a response.

Why are latency and throughput important?

If the latency is high, there is a high response delay. If the throughput is low, the amount of requests processed is low.

High latency and low throughput impair a system's performance. In some systems, such as games, latency matters a lot. If the latency is high, a user will experience lag, drastically impairing the user experience.

Using cached memory can improve server latency/throughput when making database queries. The following is an example of a latency test.

Latency Tests

Latency tests carried across the key data storage such as in-memory cache, HDD, SDD, and network calls reveal the following:

  1. Reading 1MB sequentially from cache memory takes 250 microseconds.

  2. Reading 1MB sequentially from an SSD takes 1,000 microseconds or 1 millisecond.

  3. Reading 1MB sequentially from disk (HDDs) takes 20,000 microseconds or 20 milliseconds.

  4. Sending 1MB packet of data from California to the Netherlands and back to California using a network takes 150,000 microseconds.

1000 nanoseconds = 1 microsecond

1000 microseconds = 1 millisecond

1000 milliseconds = 1 second

Therefore, reading from an in-memory cache is 80 times faster than reading from an HDD disk!

That’s all

Today, I discussed the 3 main metrics used to measure system performance and showed you examples of each. In this series, I have introduced you to the basics of system design and system design performance metrics. In addition, I listed 3 measurement metrics and discussed the System Reliability metric.

Next week, I will cover a very important topic in System Design: Proxies. We will discuss reverse and forward proxy.

DON’T LEARN ALONE. SHARE THIS NEWSLETTER WITH YOUR FRIENDS

I hope this guide gives you perspective on System Design Performance Metrics.

That will be all for this one. See you on Saturday.

Don’t forget to Sign UP for the Beta version of Masteringbackend. It comes with unmatched benefits.

Backend Engineering Resources

Whenever you're ready

There are 4 ways I can help you become a great backend engineer:

1. The MB Platform: Join 1000+ backend engineers learning backend engineering on the MB platform. Build real-world backend projects, track your learnings and set schedules, learn from expert-vetted courses and roadmaps, and solve backend engineering tasks, exercises, and challenges.

2. The MB Academy:​ The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

3. MB Video-Based Courses: Join 1000+ backend engineers who learn from our meticulously crafted courses designed to empower you with the knowledge and skills you need to excel in backend development.

4. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying. Lastly, you can hire backend engineers anywhere in the world.

LAST WORD 👋 

How am I doing?

I love hearing from readers, and I'm always looking for feedback. How am I doing with The Backend Weekly? Is there anything you'd like to see more or less of? Which aspects of the newsletter do you enjoy the most?

Hit reply and say hello - I'd love to hear from you!

Stay awesome,
Solomon

I moved my newsletter from Substack to Beehiiv, and it's been an amazing journey. Start yours here.

Join the conversation

or to participate.