DEV Community


Posted on

Data Lakes

Our digital universe doubles in size annually - and is predicted to exceed 44 trillion gigabytes by the end of 2021. Up to 90% of this data is semi-structured or unstructured, presenting challenges in storing, maintaining, and processing such a vast volume. Here is where a data lake comes in.
Alt Text

What is a Data Lake?

A data lake is a central storage repository that holds a vast amount of data in a raw formula from many different sources. It can store data that is unstructured, semi-structured, or structured. Data can be kept in a flexible format for future use as a data lake will identify the data via metadata tags to enable fast retrieval when needed.

There isn’t a limit on storage capacity as clusters of data can exist on-premises or in the cloud.

Why use Data Lakes

Data lakes have become an essential part of many big data initiatives because they offer more flexible and easier options to scale when working with significant volumes of data, especially if it is being generated at a high velocity - such as app activity data. Web, sensor, and app activity data are increasingly prevalent, and so interest in data leaks is also growing at speed.

To determine if your company needs a data lake, let’s explore the following indicators:

How Structured is your Data?

If you’re processing a large volume of semi-structured or unstructured data, it can be extremely draining on time resources to do this without a data lake. Storing mass volumes of data that isn’t structured will require extensive data preparation; this is especially true for event-based data such as clickstream or server logs.

Is Data Retention a Problem?

Storing large volumes of data in a database can be an expensive task. This can lead to much fiddling about with data retention - either limiting the period in which historical data is held or trimming certain fields of the data to control costs. If your business struggles to strike the right balance between retaining data for analytical purposes versus deleting data to control costs, then a data lake solution.

Is your Use Case Experimental or Predictable?

What you intend to do with the data is really the determining factor in whether a data lake is the best solution. Suppose you want to build reports or dashboards that will essentially be created through running a predetermined set of queries against tables that are updated regularly. In that case, you may be better off looking at a data warehouse.

However, for experimental use cases - like predictive analytics and machine learning, it can be challenging to know in advance what data you will need, which is where a data lake could be more beneficial.

Benefits and Risks of using Data Lakes

As with all business solutions, there are benefits and risks when using data lakes.

Benefits in Using a Data Lake:

  • Aids with advanced analytics and product ionizing
  • Cost-effective scalability and flexibility
  • Great value from unlimited data types
  • Significantly reduced long-term cost of ownership
  • Permits economic file storage
  • Quickly adaptable
  • Centralisation of different content sources
  • Flexible access to the data worldwide - due to the cloud system

Risks of Using Data Lake:

  • May lose momentum and relevance over time
  • The prominent risk involved while designing data lake
  • Increases storage and computes costs
  • Because there is no account of the lineage of findings by previous analysts, there is no way to get insights from others who have worked with the data
  • Access control and security can be at risk
  • Unstructured data could lead to unusable data, disparate and complex tools, and enterprise-wide collaboration.

Discussion (0)