Data has arguably become one of the most important assets for any business. Data captures the past and present state of an organization and can help shaping and forecasting its future.
Thanks to the advancements on analytics and artificial intelligence, and the doors that these technologies open for innovation, data has taken the center stage for many organizations’ strategy.
In this series of blog posts I am going to describe a few use cases that you can implement with minimal impact to your existing databases and with no need to migrate them to the cloud.
This first post explains the main cause that limits data-driven innovation and explains an approach to solve it. The posts to follow will dive deeper into the technical implementation of different solutions for different use cases.
Legacy database architectures
Even today, many businesses still struggle to use their existing data for new use cases other than those for which those databases were designed. There are several reasons for this, but the main one is due to the legacy database architectures that were used when setting those systems up.
A typical architecture consists on a transactional database, also known as OLTP (On-Line Transactional Processing)
database, and an analytical one or OLAP (On-Line Analytics Processing). Data is copied in batches from the OLTP to the OLAP database using some Extract-Transform-Load (ETL) processes. These processes are executed on a schedule (e.g.: once a day) and the tools are often provided by the same vendors of the database engines. In order to make this work, these databases are fine-tuned to achieve the required needs in terms of performance, availability, etc. This entails the work of developers, database administrators (DBAs), and business analysts to, among other tasks, design the right tables, indexes, and partitions; optimize transactional queries; and schedule maintenance and ETL jobs to minimize the impact on business operations.
To those challenges, you also need to add the inherent ones of scaling relational databases (TL;DR: it’s hard) and the constraints of using on-premises infrastructure, where you can’t just simply add more storage or spin up a more powerful server in a matter of seconds. The “you’re gonna need a bigger boat” scenario when operating on-premises usually means waiting for months until your database is running on an appropriately-sized server.
As you can rightfully guess, accommodating new uses cases in that ecosystem requires plenty of planning. Furthermore, if those use cases go into the direction of near real-time analytics or machine learning, you may find that your legacy databases are simply not fitting the bill (and rightfully so, as those databases were not designed for it).
Migrate to the cloud you say?
If you have kept with me so far, you may be thinking: “right, so I need to migrate those databases to the cloud, but...”. Well, that is an option with its own benefits, but I am aware that not everyone is ready or willing to migrate their databases to AWS.
As I explained in the introduction, the goal of this blog series is to show you how you can start leveraging the flexibility and scale of AWS to build your innovative use cases while keeping your databases where they are and minimizing the operational impact on them. One of the solutions for doing this is called Change Data Capture or CDC.
Change Data Capture
To define Change Data Capture (CDC), I’ll simply quote its article in Wikipedia:
In databases, change data capture is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data.
In a nutshell, it allows you to know that "something" has happened shortly after it has happened.
Although it depends on the source database engine, CDC typically uses the database transaction logs, so there is no need to query the database to extract the data. This has the benefit of having much lower impact on the database server than continuously reading from the tables you are interested in.
Implementing such design patterns is not an easy task, but don't fret, as there are offerings that provide out-of-the-box support for CDC.
There are many options, including AWS partners that have solutions for CDC in the AWS Partner Network and AWS' own AWS Database Migration Service (DMS).
AWS Database Migration Service
In this series, I will use AWS Database Migration Service (DMS).
AWS DMS is a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.
But more importantly (I promised you don't need to move your database from where it is), AWS DMS supports CDC and can run continuous replication jobs, so you can uninterruptedly copy data from a source database to different target databases and AWS services.
AWS DMS can use many of the most popular data engines as a source for data replication, so it's very likely that you can use it with your existing databases.
The option to use different and heterogeneous targets is the key for implementing different use cases flexibly. For example, you may want to use an Amazon Kinesis Data Stream for near real-time data processing, or Amazon S3 to hydrate a data lake that stores data for analytics or training machine learning models.
Conclusion
In this post I have outlined the issues with traditional database architectures when you need to implement new use cases and have described CDC as a way to overcome those limitations. Finally, I have provided a glimpse on some of the AWS DMS capabilities that are relevant for using CDC.
In the next blog post I will provide the technical implementation of a solution to enable CDC using AWS DMS and a Microsoft SQL Server source database.
Top comments (0)