Fauna for Fauna, Inc.

Posted on Sep 15, 2021 • Originally published at fauna.com

The Why and How of Distributed Databases

#beginners #database #architecture

Data is the lifeblood of your business — which is why you need a database at the center of it all. However, not all databases can meet the growing data needs of today's businesses. In particular, you need a distributed database system that enables you to innovate and transform effortlessly. In part one of this two-part series, we will explain what distributed databases are, how they work at a high level and the key business benefits of using them. In part two, we will compare several distributed database solutions available today on the market to know what to look for when picking your next database. So, let's get started...

How do distributed databases work?

A distributed system is a group of interconnected computers — making it appear like a single system. Typically, in a distributed database management system (DBMS), several “sites” are managed by the system, which appears as a single logical database stored at one site. Distributed databases provide location transparency, which means that applications do not need to know the exact site location where the data is stored. When a query is run on a distributed database, a collective set of sites across different data centers work together to answer the question.

Types of distributed databases

So, are all the sites in a distributed database equal? It depends on the architecture — there are two kinds — homogeneous and heterogeneous. If consistency is what you prefer, then you should go with a homogenous architecture. In this case, system attributes such as physical resources, operating system, and DBMS are uniform across all the sites. With homogenous architectures, deployment and management of database sites become easier.

If your specific use-case requires more customization, then you should go with a heterogeneous architecture. Heterogeneous database architectures allow different sites to have different attributes. The sites might not be aware of each other, and each site might use a different communication protocol, requiring additional translation of data between sites.

Data storage methods for distributed databases

Now, let's say your boss has asked you to find out the best way to store data in a distributed database - what are your options? You would have to choose between fragmentation and replication. But, how do you know which one is the better option? Well, that depends on your particular use-case and the specific requirements of your organization.

Fragmentation or partitioning involves splitting data into smaller chunks and distributing those chunks across the different sites of a distributed database. Imagine you have to store customer preference for a retailer in a relational database with customer ID as the primary key. How would you go about splitting and storing this data? Well, you could slice and dice the table either horizontally or vertically.

With horizontal partitioning, data is split by rows to decide which site the rows belong to - either by using a range, hash, or a list of column values to partition on. The column used to drive data partitioning is called the “partitioning key”. In range partitioning, which is the most common horizontal partitioning method, data rows are mapped into partitions based on predefined range values of the partitioning key. For example - let's say you're an online retailer, collecting information about your customer's preferences like the color of shoes they like. In this scenario, you can horizontally partition your data by ranges of customer ID - rows corresponding to customer IDs 1 and 2, belong to horizontal partition 1 (HP1), and rows corresponding to customer IDs 3 and 4 belong to horizontal partition 2 (HP2).

With horizontal partitioning, a data query targeting a particular partition, for example, a SELECT or UPDATE statement with a WHERE clause contained within the partition, can get results faster — non-relevant partitions can be skipped from the query processing, reducing the response time. The independent nature of the partitions also allows for more partition management flexibility without taking the entire dataset offline. In comparison, list partitioning is based on specifying a list of specific values for the partitioning key. You get much fine control of how rows map to partitions, and it offers a natural way to group data. Finally, hash partitioning hashes the partitioning key, and the hash value is used to identify which site a row belongs to.

Now, data partitioning is not drool-worthy for all use-cases. For example, if you're calculating the proportion of people liking each color, you will still need to touch and scan all the data partitions. This means that no partitions can be skipped, and you might not be able to reduce the query response time. In cases like this, you can consider using vertical partitioning. By storing all the favorite colors across all your customers, in a single table, vertical partitioning can significantly reduce the I/O and performance costs associated with accessing data. Although vertical partitioning is very helpful, it has some issues that can't be overlooked. For example, every DELETE statement execution would require ensuring that the DELETE operation is run on each partition to ensure data integrity.

Data management in distributed databases

Data replication involves making copies of the data items that can reside at more than one site at any given time. Replication brings data closer to users who rely on it to make decisions and also ensures that this data is available when it is wanted. Distributed databases make this happen by designating one of the sites as the “primary site”, and periodically syncing the other sites with the primary site. Let's say your primary site needs an upgrade, or there is an unplanned downtime event affecting your primary site — replication lets you switch users to the other sites to keep your production data available. Your applications don't have to wait for you to spin up a new copy of an entire database, which means you won't lose transactions. Replication also has its own set of challenges — it requires a high degree of coordination between the different sites in a distributed database to ensure that the data values are consistent across the distributed copies. Additionally, with large volumes of data, more disk space is needed across the different sites, bumping up costs.

When it comes to replication speed and the consistency guarantees that replication offers, distributed databases offer two types of replication options - synchronous and asynchronous. In the synchronous replication model, data written by the application to the primary site is instantly copied to all the other sites before the application is notified. In an asynchronous replication model, the application gets notified as soon as data is written at the primary site, with other sites getting data lazily in the background. Eventually, all the sites are caught up with the updated data.

Let's look at this using our previous retail example - Imagine you have two customers and only a single pair of shoes remaining in the inventory. What does the second customer see on the webpage if at the exact moment the first customer receives a purchase confirmation? With synchronous replication, the second customer would see the item as out of stock. It ensures data integrity and reduces the complexity of knowing where the most recent copy of data is located at the expense of slow response time. With asynchronous replication, the second customer would see that the item is still available. Asynchronous replication operations take less time to complete, making your application more reactive, but you get some degree of temporary inconsistencies like items appearing in stock when they are not.

Another way of having your data in more than one place is by using specialized software to make copies of data and storing them offsite in case the original is lost or damaged. This is typically called duplication (or “backups”), and it is a good option for archiving old data that won't be needed too often.

Your data, your rules

Distributed databases offer location transparency of data with local autonomy. This means that even though applications might not know where exactly the data resides, each site has the capability to control local data, administer security, keep track of transactions and recover when local site failures occur. Autonomy is available even if the connections to other sites have failed. This allows for more flexibility where specific data stored in particular sites might need more security and compliance controls versus other data might not. For example, customer information stored for the retail customers from the EU region need to meet GDPR requirements.

Benefits of using a distributed database architecture?

With data becoming an essential aspect of our lives, distributed databases lie at the heart of every organization's data infrastructure. In most cases, end-users interacting with a web service or a mobile application might not see a distributed database in action — it is the distributed database working hard in the background that is powering many of these use-cases. Here are just a few examples of the critical benefits that distributed databases bring to the table.

Improved performance

End users are complaining. Your boss is upset, and it's time to fix the slow application that everyone depends on. Where should you start looking? In many cases, the performance slowdown happens due to a bottleneck with your centralized database. With distributed databases, you can distribute data across geographies and bring it closer to your users — efficient data access and transfer results in faster application response times. Distributed databases also help you to better leverage parallel processing across commodity servers, eliminating the need for custom or expensive hardware.

Enabling massive scalability

Scalability means getting more out of your system when the system is increased with more resources. Who doesn't want a system that can scale in tune with business requirements and whenever they need it. Distributed databases are modular by design and can be easily extended on-demand. In contrast with centralized databases that can scale only vertically by adding more resources (CPU, memory, and disk), distributed databases can scale both vertically and horizontally (by adding more servers). This provides an additional degree of flexibility to scale your infrastructure. For example, due to the pandemic, many consumers turned to online retail options. If you're an online retailer, you have to quickly scale your data infrastructure to cope up with the influx of new online shoppers. Now imagine how easy it would be if you were running a distributed database.

Delivering round-the-clock reliability

Staying online 24x7 is critical to today's digital businesses. This means that if a database is unavailable, the data consumers — that is, apps, customers, and business users — can't access critical data to keep the business operational. By automatically replicating data across multiple sites, distributed databases ensure that there is data redundancy. If a failure occurs, this setup allows for easy failover to the replica site so that there are no hiccups in data access. Downtimes are an expensive affair for businesses, and it's important to fail fast, recover, and mitigate the severity of the failure. For many business applications, distributed databases provide the saving grace to ensure business continuity.

These are only some of the reasons why you should pick a distributed database. With so many different options to pick from, it’s important to know what characteristics to look for and how these compare across the different databases in the market. Stay tuned for part 2 of this series, where we will explain what these characteristics are and why they matter.

So, what are the challenges in traditional distributed DBMS?

Over the last few decades, distributed databases have come a long way. However, they still have a few key challenges that are worth mentioning -

Performance limitations at internet scale

Allowing writes across a geo-distributed database that is accessed by millions of users is challenging. However, this is a common use-case among today’s modern apps, including IoT, e-commerce and social networks. Many traditional distributed databases have solved this by having a single primary region responsible for orchestrating the writes and making local data closer to the users, only available for reads and not for updates. This design may severely affect the performance of a system.

Scaling is complex

Data partitioning is not a silver bullet, and selecting a partitioning key is an art in and of itself. If you pick the wrong partitioning key, you can disturb the load balancing of data, making some partitions hotter than the others. This reduces the effectiveness of the partitioning and overcomplicates your database management and maintenance.

Database model != programming model

Traditional distributed database systems have only one data model, and in most cases, this singular data model does not fit well for today's modern applications. The incompatibility between the application data types and what the database model offers is called an 'impedance mismatch'. This calls for additional programming language bindings and a database change whenever the app changes.

Decentralized data governance and security

A significant challenge in designing and managing a distributed database is the inherent lack of centralized knowledge of the entire database. This also applies to things like data governance and security in a distributed database. Lack of a consistent approach to both of these aspects introduces risks, and any data breach can quickly tarnish the image of an enterprise and be expensive.

High TCO, needing a dedicated operational team

Distributed databases are complex, needing a fully dedicated operational team to manage your data infrastructure. The costs associated with running a distributed database, such as hardware procurement, maintenance, and hiring costs across different geographies, adds up pretty fast to make it costlier than a typical DBMS.

So, how does Fauna fare compared to the other distributed DBMS solutions in the market?

Fauna is a flexible, developer-friendly, transactional cloud database delivered to you as a secure data API built for modern web applications embracing the cloud. Fauna rethinks traditional DBMS as a "data API", providing you with all the capabilities of an old-guard database without sacrificing flexibility, scale, and performance. With Fauna, developers can quickly wrap business logic into programmable functions and run them near the data in a serverless fashion. This accelerates application performance and offers more developer agility.

In the table below, we’ll look at several key DBMS attributes across different vendors, and explain why they matter for your application -

Attribute	Fauna	AWS Aurora	DynamoDB	MongoDB	Why does it matter to your application?
Data model	Relational and document-oriented based	Relational	Non-relational (schema-less data items in a table)	Non-relational document-based data model	If the data model fits perfectly for your use-case there are several benefits for your application.
Query Model	API based query model using the Fauna Query Language (FQL) and GraphQL	SQL based query model similar to popular technologies such as MySQL, Postgres. Also compatible with Amazon Redshift query model	Uses PartiQL, which is a SQL-compatible query language or DynamoDB’s classic CRUD APIs	MongoDB query language (MQL) which is a query language based on JavaScript.	A query model defines how apps interact. To simplify app development, you need a modern and simple query model. APIs fit this bill.
Transactional Model	Uses mechanics of Calvin to support strictly serializable, externally consistent transactions in a distributed environment	ACID based transactional model.	Serializable multi-item read and write transactions. Only ACID compliant within the region they occur in.	Multi-document ACID transactions	Depending on the use-case, a certain transaction model might be more desired than another.
Consistency Model	All Fauna queries are consistent across all deployment models and offer strictly serializable isolation levels.	The shared codebase also restricts Aurora’s consistency model to only primary/secondary replication. It is impossible to perform write transactions or consistent read-only queries in non-primary processes and regions.	Strong consistency is available to some degree but only within the context of a single region.	To ensure consistency, developers must prevent queries from including data that could be rolled back, and ensure that no writes will occur during the read	Database with a strong consistency model is preferred to ensure that use-case functionality is not negatively affected
Indexing	Provides field-based secondary indexing.	AWS uses the same indexes as MySQL/InnoDB. Aurora also supports spatial indexes.	Supports primary and secondary indexes (local and global).	Provides primary and secondary indexing, and specialized indexes such as hashed indexes, wildcard indexes, and geo indexes.	With multiple indexing options, there is more room to tune your database performance as your data grows. The result is better user experience for your application users.
Fault tolerance	Fauna’s underlying architecture makes it highly available and fault tolerant. It’s underlying architecture routes requests to the available processing nodes that are closest to the source. The transaction protocol replicates data in at least three locations ensuring that Fauna never loses your data.	The cluster volume spans multiple Availability Zones in a single AWS Region, and each Availability Zone contains a copy of the cluster volume data. This functionality means that your DB cluster can tolerate a failure of an Availability Zone without any loss of data and only a brief interruption of service.	DynamoDB relies on AWS Availability Zones (AZ), replication, and long-term storage to protect against data loss or service failure. The leader nodes are potential bottlenecks as only they are capable of accepting writes and strongly consistent reads	Node failures in MongoDB are handled by the replica set. If the primary fails, a new primary is elected by the remaining nodes with a prioritized version of Raft, and the system continues to operate normally.	A database system with a high degree of fault tolerance is preferred by users building mission-critical applications.
Security	Fauna offers a web-native security model. This means it provides authN with keys and tokens. Attribute-based access control (ABAC), and pre-defined roles can be used for authZ. TLS/SSL for on-the-wire encryption between the database and clients, client/tenant separation via database hierarchies, priority workloads, as well as secure access tokens for direct client access to the database. Fauna clusters require authentication and cannot be left accidentally unprotected.	AWS is responsible for protecting the infrastructure that runs AWS services in the AWS Cloud. You are also responsible for other factors including the sensitivity of your data, your organization's requirements, and applicable laws and regulations.	Like many AWS products, DynamoDB inherits the excellent AWS Identity and Access Management (IAM) feature. With it, developers can specify coarse and granular user permissions, applicable to the entire DynamoDB API.	MongoDB offers support for SCRAM, x509, LDAP, and Kerberos authentication, role-based access control with user-defined roles, permissions on collection subsets via non-materialized views, TLS/SSL for the database and clients, encryption at rest, auditing controls, and tenant separation via databases.	If application data is breached, the organization faces huge risks and penalties. Lack of sufficient features also means that the database might not be suitable for regulated industry use-cases.
Scalability	Fauna delivers unlimited scale with zero input from customers. Fauna achieves this by maintaining several consistent, full replicas of customer data and scaling up its infrastructure behind the scenes.	Aurora storage automatically scales with the data in your cluster volume.	DynamoDB aims to absorb the responsibility of scaling to customer needs. Unlike Fauna, it still leaves significant operational work and overhead for customers making it less favourable	MongoDB provides sharded clusters as a way to horizontally scale the largest of workloads across many replica sets. Creating a sharded cluster requires a small amount of downtime to deploy config servers, and to point application drivers to “mongos” processes instead of the data nodes.	If the application faces an influx of new users, the ability to have easy scalability is a must.
Operations	Fauna does not require any operational work from users to manage the scalability and availability of the system. It all happens automatically. Fauna does offer a CLI for users to help script schema administration tasks.	Amazon Aurora releases updates regularly. Updates are applied to Aurora DB clusters during system maintenance windows. Does not require any operational work from users to manage the scalability and availability of the system.	Developers can implement and tweak DynamoDB deployments through the AWS CLI, AWS Management Console, AWS SDK, NoSQL Workbench, or directly through the DynamoDB low-level API.	MongoDB relies on Ops Manager, Cloud Manager, or the software behind MongoDB Atlas to apply complex changes to the database.	With reduced operational overheads to run a database, developers can focus more time building their applications.

Get started on Fauna, instantly and for free

Sign-up for free

The data API for modern applications is here. Sign-up for free without a credit card and get started instantly.
Sign-up now

Quick start guide

Try our quick start guide to get up and running with your first Fauna database, in only 5 minutes!
Read more

DEV Community