- Level Up Coding
- Posts
- LUC #47: Database Sharding Explained — Strategies for Scalable Database Management
LUC #47: Database Sharding Explained — Strategies for Scalable Database Management
Plus, API design best practices, What is Kafka, and how does it work, MVC vs MVP architecture
This week’s issue brings you:
Database Sharding Explained: Strategies for Scalable Database Management
API Design Best Practices (Recap)
READ TIME: 5 MINUTES
A big thank you to our partner Postman who keeps this newsletter free to the reader.
POST/CON 24 is for anyone who works with APIs or whose business relies on APIs. Check it out.
Database Sharding Explained: Strategies for Scalable Database Management
What’s one of the most common bottlenecks for an application?
You guessed it—the database.
To keep our systems performant and scalable in the face of enormous volumes of data and the need for rapid processing, implementing database scaling solutions is crucial.
There are several database scaling solutions. One of the most powerful but also most complex is database sharding.
Today, we’ll explore database sharding, when and where to use it, and best practices. Let’s dive in!
Understanding Database Sharding
Database sharding divides a database into smaller, more manageable segments known as "shards," which are distributed across various servers.
This approach differs from conventional scaling strategies like replication, which makes duplicate copies of data across several servers, and vertical scaling, which entails boosting the capacity of an already-existing server.
The main benefit of sharding is its capacity to distribute data throughout a network of computers, greatly enhancing scalability and performance.
Sharding is done via two approaches—horizontal sharding, and vertical sharding.
Horizontal Sharding
Horizontal sharding, also known as data partitioning, splits a database by row.
Each shard holds the same schema but contains a different subset of the data.
This is done by applying a consistent sharding key or algorithm to distribute rows across multiple databases or servers.
For example, user data could be sharded based on geographic location or user IDs, so that all data related to a particular region or range of user IDs is stored together.
Vertical Sharding
Vertical sharding involves dividing a database into smaller sets based on the table. This method separates different tables or groups of tables into distinct databases, with each shard holding a subset of the table data.
Vertical sharding is particularly useful when certain tables grow significantly larger or are accessed more frequently than others. By isolating heavily accessed tables, vertical sharding can reduce the load on a single database server and improve performance for specific queries.
Both sharding techniques facilitate horizontal scaling, i.e., adding more machines to a system to distribute/spread the load. How they do this differs.
Vertical sharding is table-centric, making it well-suited for databases where specific tables are disproportionately larger or heavily queried.
Horizontal sharding, on the other hand, is data-centric, which makes it better suited for evenly distributing a large volume of similar data across several servers.
Database Scaling Techniques That Should Be Exhausted Before Sharding
Before you start sharding your database, an important principle should be kept in mind.
You shouldn’t implement premature optimizations or attempt to scale your app before it’s actually needed. Implementing scaling solutions introduces complexities such as:
Adding new features takes longer
The system becomes more complex with more pieces and variables involved
Code can be more difficult to test
Finding and resolving bugs becomes harder
You should only accept these trade-offs if your app is at capacity. Keep the system simple, don’t introduce scaling complexities unless it’s warranted.
Database sharding is complex. Several more straightforward solutions might address performance issues effectively:
Vertical scaling
Adding resources to your existing server may provide a short improvement in performance, but it has cost and scalability restrictions.
Database and query optimization
Significant speed improvements can be achieved here, with minimal complexity.
Connection pooling
Application speed can be enhanced and overhead can be significantly decreased by managing database connections more effectively.
Read replicas
Enabling read replicas can assist in offloading read activities from the primary database, improving read performance.
Caching
A powerful yet simple solution. Utilizing caching to store frequently used data can significantly reduce database load.
Database partitioning
Splitting large tables into smaller more manageable pieces inside the same database can enhance data management without the need for sharding.
Why Shard a Database?
Sharding becomes relevant when applications hit scalability ceilings and performance bottlenecks that simpler approaches can't mitigate.
By distributing data across multiple servers, it reduces the load on any single server, enhances response times, and offers a scalable architecture that grows with the application.
Challenges and Considerations
Despite its advantages, database sharding poses several issues and complications.
These include choosing the right sharding key, handling cross-shard transactions, and maintaining data consistency between shards.
Careful preparation and implementation are essential for effectively navigating these tricky obstacles.
Best Practices for Database Sharding
Sharding key selection
To prevent unequal load distribution, use a sharding key that distributes data uniformly among shards.
Consistent hashing
When scaling the shard design, use consistent hashing for the shard distribution to reduce the effect.
Monitoring and automation
Implement monitoring to track shard performance. And use automation for shard maintenance and data rebalancing.
Limit cross-shard transactions
As cross-shard transactions can impede performance and complicate processes, it is best to design the application to minimize them.
Shard proximity
To improve access times for applications that are sensitive to latency, take into account the physical location of shards.
Extensive testing
Test the sharding technique extensively under real-world conditions to detect and resolve any concerns.
Future growth planning
Allow for expected changes in data volume and access patterns while designing the sharding system with potential expansion in mind.
Wrapping Up
Database sharding provides a scalable framework for applications that need to grow beyond a centralized system.
While it is very powerful, it’s also one of the most complex database scaling solutions. Therefore, more straightforward scaling solutions should be exhausted prior to implementing sharding. If sharding is taken on, it should be carefully planned and implemented to navigate the tricky challenges it comes with.
API Design Best Practices (Recap)
There are several aspects, techniques, and best practices in API design.
Idempotency, security, versioning, clear resource naming, use of plurals, cross-referencing resources, sorting, and filtering are all aspects that can be observed in the URL.
However, best practices go far beyond what can be observed in API URLs.
Thorough documentation, robust monitoring and logging, consistent error handling, and rate limiting are some of the other primary best practices that should be implemented to design effective and safe APIs
MVC vs MVP Architecture (Recap)
MVC (Model View Controller) and MVP (Model View Presenter) are design patterns that separate an application into distinct components.
MVP was established after MVC to improve on the drawbacks of MVC and improve maintainability.
Both patterns aim to separate concerns, but they have some differences in their approach.
The most significant procedure differences relate to which component handles the business logic and how the UI gets updated.
MVC is typically less complex but it can have tightly coupled components. Whereas, MVP is generally more complex but has more decoupled components.
What is Kafka, and How Does it Work? (Recap)
Kafka is an open-source distributed streaming platform designed for building real-time data pipelines and streaming applications.
Kafka operates as a distributed pub-sub messaging system. Allowing applications to publish and subscribe to real-time or near-real-time data feeds.
The high throughput, scalability, fault-tolerance, durability, and ecosystem Kafka provides have made it a very popular choice for use cases where real-time data feeds are required.
The key components of Kafka include Producer, Consumer, Broker, Topic, and Partition.
Kafka has many use cases, from aggregating data from different sources to monitoring and real-time analytics.
That wraps up this week’s issue of Level Up Coding’s newsletter!
Join us again next week where we’ll explore how to optimize CI/CD pipeline performance, binary trees explained, and techniques to improve API performance.