SeattleDataGuy

Posted on Jun 26, 2020

What Is Data Virtualization And Why Use It?

#sql #database #datascience

With the increased need for data analytics and data science that is even more granular and allows for self-service analytics. Getting access to the data you need still is far from easy.

If you have worked in any role related to data, or even just adjacent to it, you are probably used to and frustrated by the process of getting access to that data.

Oftentimes your team will need to work with a BI or data engineering team to get the data pulled from all the various third-party sources, internal applications, and databases. This could take weeks, maybe even months. All while your business partners are constantly wondering where their numbers are. Pressuring you and not understanding what is taking so long.

All so your team can do a quick analysis.

What if you had other options?

What if you could just access the data of multiple systems in one place.

Well, there are options.

Using a concept called data virtualization.

Data Virtualization

Data virtualization provides a virtual layer on top of all of your data storage systems so you can easily connect them even if they are in different data storage systems. With data virtualization, you can connect data across an organization in one central place without duplicating data into a data warehouse. This provides a large opportunity in terms of cost savings as well as reducing technical work.

Several companies are working on providing companies access to their data without developing complex data warehouses and ETLs just to access their data.

In this article, we will discuss the benefits of data virtualization as well as discuss Denondo and Promethium which both provide the service of data virtualization.

What Can Benefit From It?

Faster Analytics

Directors, CTOs, and in general, decision-makers are no longer ok with waiting months to get a new report. As a data analyst, you used to be able to point to other departments that were slowing you down. You needed to put in data requests that would get lost in the sea of other IT requests.

But with so many self-service analytics tools, not getting access to data and developing reports developed quickly can be a major disadvantage. Your competition might already be getting insights on the newest happenings in the world while you are trapped behind an archaic system.

Data virtualization looks to increase the speed of which analysts can access data by simplifying the entire process. Thus improving the speed of your analytics. The goal being that now when a decision-maker asks a question, they can have an answer in a few hours or the next day. Not 3 months from now.

Reduces Workload On A Data Engineers

Data engineers and BI teams are often the bottlenecks for getting data analyst's data. It's no fault of their own. There are so many different initiatives and projects going on that it can be difficult to manage every ad-hoc data request that comes down the pipeline.

This allows your data engineers to focus on larger, more impactful work rather than focusing heavily on smaller data requests.

Simplifying Data Workflows And Infrastructure

Getting data from all of a company's various database systems and third-parties is very complicated. This third-party API is SOAP-based with XML, another one only exports CSV reports and another is only updated every 24 hours.

This of course doesn't even account for all the various database systems and cloud storage systems. The world of data is becoming more and more complex. This makes it hard to get all of your data into one place.

You need lots of ETLs, data warehouses, and workflows to manage all of the various data sets. Even then, sometimes all that data just gets siloed off for each team.

Making for a very complex and difficult world for the data analyst to work in.

Data virtualization circumvents that by connecting data sources virtually and not requiring a separate ETL for every process and data source.

Overall, simplifying your company's data infrastructure and reducing the number of workflows required.

Data virtualization is infrastructure-agnostic.

This means you can easily integrate all data with whatever your companies current databases are, resulting in lower operational costs. You could be using Oracle, MySQL, Postgres, AWS RDS, and so many other database backends, but data virtualization's goal is to integrate all of them into one final system.

Some of this is dependent on the data virtualization provider you choose. But overall, many of them are quite capable of integrating with most databases.

Denodo and Promethium

Speaking of data virtualization providers. Let's talk about two products that are currently on the market, Denodo, and Promethium.

Denodo

One of the better-known providers of data virtualization is Denodo. Overall the product is arguably the most mature and feature-rich.

Denodo's focus on helping users gain access to their data in essentially a single service is what makes it so popular with its many customers. Everyone from healthcare providers to the finance industry relies on Denodo to relieve pressure off of BI developers and data scientists by reducing the necessity to create as many data warehouses.

Denodo also works across cloud providers. Meaning you can use it across AWS, GCP, Azure, and other cloud providers. Another huge advantage when it comes to analyzing data. Many companies have their data across many cloud providers.

This makes it difficult to analyze data and that's where data virtualization steps in.

Denodo Technical Features

From a user experience perspective, Denodo feels very similar to many database products. It kind of reminds me of SQL server in the sense that it looks like a sidebar with a bunch of folders and the main display is some database design or configuration.

You can see this in the image below.

Denodo allows you to connect to various data sources from Denodo itself. This includes Oracle, SQL Server, Hbase, etc.

It also uses VQL which stands for the virtual query language. This is similar to SQL and allows developers the ability to create views on top of all the databases that Denodo is connected to. This means you can connect different databases and tables across them and create new data sets.

This is one of the many benefits Denodo offers. It allows users to create virtual views based on business logic across departments. This can be very valuable considering how locked up most data is behind data silos and application databases.

Denodo will probably remain one of the more popular data virtualization technologies for a while. However, many are looking to supersede it.

Promethium

Promethium is one of the many data tools looking to approach data virtualization from a slightly different angle compared to Denodo. Promethium's goal is not only to provide data virtualization services but also it aims to make the data analytics process easier in general.

Promethium's product that they have called the Data Navigation system acts as an augmented data management solution. The goal of this system is to provide analysts the ability to validate, discover, and assemble data easily.

The result of the product is supposed to be an all in one tool that takes what took several people a few months and makes it only require one person and a few minutes.

This is somewhat similar to the data virtualization features that Denodo offers.

However, the goal of Promethium is to take several steps further after data virtualization.

Promethium Technical Features

With your data virtualized, Promethium now looks to use Natural Language Processing (NLP) to help you build out the queries, or as they reference it "questions" you will be asking on your data sets. Even storing common questions that are asked. For example, let's say you want to know what the average cost of products is by county. Promethium would attempt to develop a query that matches this request.

Also, it has another feature it references as RPA or Robotic Process Automation*. *The goal of this feature is to automatically associate data sets that belong together.

All of these features are geared towards making a truly self-service analytics system. Data virtualization alone still has a lot of gaps.

Like connecting disparate data sets. So trying to create a system that can detect relationships would be quite beneficial.

Overall, there are a lot of options for third-party tools looking to solve the self-service analytics problem.

Conclusion

Data virtualization offers an opportunity for data analysts and data scientists to analyze data without creating data pipelines. This allows your teams to analyze and meld data quickly. This in turn gives them the ability to answer their director's and manager's questions without being stalled by data engineers who are often busy with their work. Denodo and Promethium are both looking to fill the gap in data virtualization.

However you end up deciding to approach your ad-hoc data analysis, we wish you good luck!

Automate Data Extracts From Google Sheets With Python

Kafka Vs RabbitMQ

Data Engineering 101: Writing Your First Pipeline

Data Engineering 101: An Introduction To Data Engineering

What Are The Different Kinds Of Cloud Computing

4 Simple Python Ideas To Automate Your Workflow

4 Must Have Skills For Data Scientists

SQL Best Practices --- Designing An ETL Video

5 Great Libraries To Manage Big Data With Python

Joining Data in DynamoDB and S3 for Live Ad Hoc Analysis