- Backend Weekly
- Posts
- Understanding Database Sharding
Understanding Database Sharding
I will explain Database Sharding, how it works, sharding architectures, benefits, and alternatives to database sharding
Hello “👋”
Welcome to another week, another opportunity to become a Great Backend Engineer.
Today’s issue is brought to you by Masteringbackend → A great resource for backend engineers. We offer next-level backend engineering training and exclusive resources.
Before we get down to the business of today. Part 11 of Understanding System Design.
I have a special request from you: You will love this one.
Finally, I’m starting my Advanced Backend Bootcamp 2.0. Since I launched the first one, I have been preparing for this with my team for over a year.
This bootcamp is different. It’s a 4-6 month in-depth backend engineering Bootcamp diving deep into backend engineering and building real-world backend projects.
Not to say much, because I’m fully prepared to turn you into a great backend engineer.
Click here for all the information you need.
Now, back to the business of today.
In the previous edition, I discussed Database Replication and explored how it works and the types, advantages, and disadvantages of database replication.
In this episode, I will explain Database Sharding, how it works, sharding architectures, benefits, and alternatives to database sharding.
Overview of Database Sharding
A database is a significant component of our system design. However, it can easily become a pain when our application receives significant growth and traffic. Eventually, we must scale the database server to accommodate the traffic increases.
Scaling for large applications or websites must ensure the security and integrity of data. Sometimes, it can be difficult to forecast how popular a system will become. That’s why some companies choose dynamic database architectures to scale automatically.
There are many such dynamic database architectures, and we have discussed the first one, Database Replication. Today, we will explore Database Sharding.
What is Database Sharding?
Database Sharding is one of the database architectural patterns related to Horizontal Partitioning. It’s a process of separating one table into multiple tables, known as partitions, using predefined sharding architectures.
Each partition has the same schema and columns. However, the data is unique and independent of other partitions.
Let’s explore the two types of database partitioning that can be implemented to shard your database.
Horizontal Partitioning
Vertical Partitioning
Horizontal Partitioning: This type of partitioning is most commonly used in database sharding, where table rows are separated into multiple tables in a database. The diagram below illustrates horizontal partitioning visually.
Vertical Partitioning: Vertical partitioning separates entire columns and puts them into new, distinct tables. The diagram below illustrates horizontal partitioning visually.
The explanation above shows that vertical partitioning focuses on table columns while horizontal partitioning focuses on table rows.
General Overview of Horizontal and Vertical Database Sharding
Sharding involves splitting data into two or more chunks. The shards are then distributed across multiple database nodes. The data held across all the shards cumulatively represent an entire dataset.
In most cases, sharding is implemented on the application level, where the application contains logic that defines where to read or write data.
In the diagram above, splitting the user tables according to region should significantly improve data reads/writes. If a request comes from Europe, the application logic should forward it to the Shard that contains European data instead of traversing it through a single database with billions of user records.
This significantly improves the system's performance, lowers latency, and increases throughput by removing load on a single database. It can further be improved by installing a database node around the users' region.
Now that I have clarified Database Sharding to some extent. Let’s explore its benefits and drawbacks.
Benefits of Database Sharding
Improved Performance and Scalability - Sharding enables horizontal scaling, allowing the database to handle larger amounts of data and higher workloads. By distributing data among multiple shards, each can handle a subset of the overall data and workload. This parallel processing capability improves the system’s overall performance and scalability.
Outage Management—Sharding provides better fault isolation and availability than a single monolithic database. If one shard or database system fails, the other shards remain operational, ensuring the system can continue functioning. Sharding also allows for distributed data replication and backup strategies, enhancing fault tolerance and availability.
Geographical Distribution and Localized Data: Sharding facilitates data distribution across geographical regions. This can benefit applications with a global user base, enabling data to be stored closer to the users, reducing latency, and improving the user experience. Sharding also complies with data sovereignty regulations, as data can be stored in specific regions based on user location.
Drawbacks of Database Sharding
Increased Complexity - Sharding involves distributing and coordinating data across multiple shards, which can be challenging to set up and maintain.
Data Skew and Hotspots (celebrity problem) - Some shards may experience higher loads or data skew, while others remain underutilized. Some shards will experience more traffic than others as users request the profiles of the celebrities. You must manage and rebalance data across the shards for an even workload distribution.
Schema Changes: Making schema changes or altering data structures across multiple shards can be more challenging than in a single database system.
Difficulty in unsharding databases: When your database is shared, and you want to return to the normal monolith database, it can become very difficult and require a high level of expertise and attention to detail.
Sharding is not natively supported by every database engine: In most databases, Sharding is not natively supported and requires lots of configuration or forked instances of the database to get it working. For example, Postgre does not support Sharding natively except for some forked copies of Postgre. However, Some specialized database technologies — like MySQL Cluster or certain database-as-a-service products like MongoDB Atlas — include auto-sharding as a feature.
These are only general benefits and drawbacks you must consider before choosing Database Sharding. There are a lot more benefits and drawbacks that I could cover here based on different use cases.
Now that I have covered some general benefits and drawbacks of Database Sharding. Let’s look at some Sharding Architecture.
Database Sharding Architecture
Once you decide to proceed with database sharding, the next step is to choose a sharding architecture and how to implement it. The requests must go to the right/intended database when distributing the workload to different shards. Below are some of the database sharding architectures from which to choose:
Range Based / Dynamic Sharding
Range-based sharding splits database rows of a table based on a range of values. To demonstrate, let’s look at the diagram below. We quickly saw that we split/shared our database based on the first letters of the user’s first name.
Range-based sharding
The main advantage of range-based sharding is its relatively simple implementation. However, it can result in data overloading on a single physical node. In our example, shard J-R
, which contains names that start from J-R, might contain much larger rows than others.
Key/Hashed Sharding
Hashed sharding uses a hashed function(mathematical formula) to assign a shard key to each database row. The hash function takes the information of the row and produces a hash value used as the shard key.
Key-based sharding
Hashed functions are best when you want to distribute data evenly among the physical shards. However, it does not separate the database according to meaningful information.
Directory-Based Sharding
Directory sharding uses a lookup table to match database information to a shard. A lookup table contains a static set of information about where specific data can be found.
Directory-based sharding
Each shard is a meaningful database representation and is not limited by ranges. However, directory sharding fails if the lookup table contains the wrong information.
Geo sharding
Geo-sharding splits and stores information according to the user's geographical location. For example, users’ information can be stored according to the user region, such as Africa, Europe, Asia, North America, etc.
Geo-based sharding
Geo sharding greatly improves the user experience, as it responds to user requests faster because of the distance between the user and server. However, it can also result in uneven data distribution.
Should I Shard My Databases?
When your application grows in size, there is a need to scale. Using a Sharded database architecture can solve your problems, but some see it as a headache that should be avoided unless necessary. Below are some of the scenarios where it may be beneficial to shard a database:
Enormous amount of data—When dealing with extremely large data, splitting it into multiple shards to remove performance bottlenecks is advisable.
Isolation of Different Customer Data - In multi-tenant applications, data from different organizations must be isolated. For example, in applications such as Office 365, sharding the databases is an effective method. Each shard can be dedicated to a specific organization, providing data separation, privacy, and security.
Geographic Distribution - an application or service with a global user base. Sharding can be useful for distributing data closer to users in different regions. Each shard can be located in a specific geographical region, reducing data access latency and improving the user experience.
Alternative to Database Sharding
Before you consider sharing your database, remember it’s difficult to unshard a sharded database. You should explore other options for optimizing your database. I will outline some of these options below:
Setting up a Remote Database: This could be a good option if you’re working with a monolithic application where all your infrastructures, including your database, reside on the same server. You can improve your system's performance by easily abstracting the database to a remote server and scaling it from there.
Implementing Caching: This is application-level optimization. If you noticed slow read performance in your application. Your next thought would be to add a distributed caching server. This will significantly improve your database Read performance instead of sharding.
Use Database Replication: Creating one or more read replicas of your database could significantly improve your performance. This involves copying the data from one database server (master/primary server) to one or more servers (slave/secondary). Here’s a detailed guide on Database Replication.
Horizontal/Vertical Scaling: In most cases, adding more resources to your server, whether horizontal or vertical scaling, will do the trick. This method is a better option than using Database Sharding.
However, if your application grows past a point where none of these strategies works, consider implementing a good Database Sharding strategy.
That will be all for this week. I like to keep this newsletter short.
Today, I discussed Database Sharding, how it works, and the types, advantages, and disadvantages of database replication.
Next week, I will start exploring Database Indexing.
Don’t miss it. Share with a friend
Did you learn any new things from this newsletter this week? Please reply to this email and let me know. Feedback like this encourages me to keep going.
See you on Next Week.
Remember to get the Masteringbackend → A great resource for backend engineers. We offer next-level backend engineering training and exclusive resources.
Top 5 Remote Backend Jobs this week
Here are the top 5 Backend Jobs you can apply to now.
👨‍💻 Webscope.io
✍️ Back-end Developer (.NET, Azure)
đź“ŤRemote, On-site, Brno
đź’° Click on Apply for salary details
Click here to Apply for this role.
👨‍💻 Docplanner
✍️ Backend developer
đź“ŤRemote, Italy
đź’° Click on Apply for salary details
Click here to Apply for this role.
👨‍💻 GitLab
✍️ Backend Engineer, Govern: Threat Insights
đź“ŤRemote
đź’° Click on Apply for salary details
Click here to Apply for this role.
👨‍💻 Robots & Pencils
✍️ Backend Developer (Node.js)
đź“ŤRemote, Toronto, Canada
đź’° Click on Apply for salary details
Click here to Apply for this role.
Want more Remote Backend Jobs? Visit GetBackendJobs.com
Backend Engineering Resources
Whenever you're ready
There are 4 ways I can help you become a great backend engineer:
1. The MB Platform: Join 1000+ backend engineers learning backend engineering on the MB platform. Build real-world backend projects, track your learnings and set schedules, learn from expert-vetted courses and roadmaps, and solve backend engineering tasks, exercises, and challenges.
2. ​The MB Academy:​ The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.
3. MB Video-Based Courses: Join 1000+ backend engineers who learn from our meticulously crafted courses designed to empower you with the knowledge and skills you need to excel in backend development.
4. GetBackendJobs: Access 1000+ tailored backend engineering jobs, manage and track all your job applications, create a job streak, and never miss applying. Lastly, you can hire backend engineers anywhere in the world.
LAST WORD đź‘‹
How am I doing?
I love hearing from readers, and I'm always looking for feedback. How am I doing with The Backend Weekly? Is there anything you'd like to see more or less of? Which aspects of the newsletter do you enjoy the most?
Hit reply and say hello - I'd love to hear from you!
Stay awesome,
Solomon
I moved my newsletter from Substack to Beehiiv, and it's been an amazing journey. Start yours here.
Reply