Geo-Distributed Data Lakes Explained (By a Non-Developer)

#database #distributedsystems #webdev

Happy 2021! This week’s term is a mouthful, but don’t let the long name deter you. It’s a pretty interesting topic and I think you will agree after finishing this breakdown. There is a lot to say about how awesome it is to combine the flexibility of a data lake with the power of a distributed architecture, but I’ll get more into the benefits of both as a joint solution later. To start, I want to look at geo-distributed data lakes in two parts before we marry them together, for my non-developer brain that made the most sense! No time to waste, let’s kick things off with the one and only… data lakes.

It’s a Data LAKE, Not Warehouse!

It shouldn’t be a shock to the system to point out that we are living in a data-driven world going into 2021. Because of this, “data lakes” are a fitting term for the amount of data companies are collecting. In my opinion, we could probably start calling them data oceans, expansive and seemingly never-ending. So what is a data lake exactly? Think of all your data as the water and your repository as the lake that holds that water. Unstructured data or “water” comes from one source and your structured data/water comes from another. You can use any of the water coming from any number of water sources (i.e. multiple structured and unstructured data sources) to build out visualizations, real-time analytics, or even machine learning models. So while your water is flowing in from rivers, creeks, and mountain runoffs, you can drink any of that water and it will keep you hydrated.

Data lakes can be on-premise or hosted in the cloud, and I think my favorite thing about data lakes is that the natural or raw form of the data that is stored in a lake is usually called an “object blob.” What is the first thing you think of when you hear the word blob? It’s such a great word. So thanks to object blobs in data lakes, data management becomes more cost-effective than other solutions when it comes to storing historical data because it allows you to store both relational and non-relational data. Data lakes democratize data, which simply means that everyone has access to the data instead of having gatekeepers or admins that decide who has access. Data lakes also provide folks with an easy way to understand said data that is shared in the data repository. I can imagine at large companies with many employees and departments, a data lake makes company collaboration ten times easier.

There is a difference between a data lake and a data warehouse, which should be mentioned before we move on. According to AWS’s website, a data warehouse is a repository database optimized for specific data used to analyze relational data coming from transactional systems and line of business applications. So that means data warehouses do not handle unstructured data or are not optimized to handle it. So I say, in 2021 with hundreds of data sources, all hail the data lake!

Geo-Distributed: Data All Over the World

In general, geo-distributed is used in reference to data storage, websites, applications, containers, etc. For the sake of this article, we will focus on geo-distributed data storage. A one-sentence description of geo-distributed databases and storage would tell you that it means a database technology deployed across more than one geographical location without performance delays. Some might say that geo-distributed data storage can come in the form of zone, region, or even multi-cloud.

Multi or Hybrid cloud is an important architecture in this case because it's a powerful and cost-saving example of geo-distributed. To quote our very own CTO Kyle “Hybrid cloud is the peanut butter in your chocolate, it can be an intermixing of public cloud services, but is more typically a blend of private cloud (or on-premise) with public.” When working with hybrid cloud, you are using multiple local "edge" nodes closer to the end-user, instead of large centralized data centers, hence to implement hybrid cloud you are using a geo-distributed data storage architecture.

Geo-distributed functionality is great in the sense that with the increased redundancy, you don’t need to worry as much about one data center, cloud instance, or on-premise site going down. A fail in one location isn’t the end all be all for your team, data is gold after-all. Global performance is improved because queries are distributed across many different servers in parallel, and users are able to hit a database that is physically closer to them, ultimately reducing latency (or if you want to be fancy you can call it “interquery and intraquery parallelism”). The user experience is also better when data storage is distributed because of the rapid query times.

Mix Them Together: Hello Geo-Distributed Data Lake!

So we have all the ingredients, now let’s bake this data cake! From here it’s pretty easy to understand that a geo-distributed data lake is a type of geo-distributed data storage. A data lake distributed across multiple locations….kinda sounds like the data version of an ocean! As a powerful way to collaborate efficiently with large dispersed teams, geo-distributed data lakes make big data analysis easy and user-friendly. Companies spread across the country or even the world can easily access company data and know that the data they are looking at is the most up to date version, which comes in handy, especially when dealing with real-time mission-critical analytics. Here's a nice bulleted list for all my dev friends on the advantages of a distributed data lake (hint: I touched on some of them when talking about geo-distributed data storage):

Collaboration and Synchronization- These two things can become streamlined and done in real-time when you have multiple copies of a database out in the world. Local teams can have the data most important to them running nearby, making it easy to pull the most recent updates.
Data Redundancy and Recovery- Similar to when we talk about general-purpose distributed data storage, you don’t have to worry about network, data center, or any other outages or downtime. You have backup from your other replicated data lakes there to give you peace of mind.
Performance- Instead of one large data lake, you now have many smaller data lakes spread across your network (also known as load distribution). You aren’t hammering one system all at once.
Agile Development and Data Analytics- Distributed data lake means you can use the same data across different applications along with improved collaboration and sync, as listed above. Your team can work in a smarter and faster fashion.
Scalability- It's easier to scale your data collection when you not only have the flexibility of a data lake that can collect both unstructured and structured data, but you also have access to multiple copies of that data lake spread across different locations. With a distributed architecture, you can easily add additional data lake nodes as demand increases.
Cost Savings- If you are using the hybrid cloud model that we discussed earlier, then you can significantly cut the costs of your data lake. This is because you are not exclusively locked into cloud providers and their cloud hosting costs, instead, you using localized edge nodes that are maintained by your organization or a third-party.

Use Cases & Tools for Geo-Distributed Data Lakes

With so many benefits to geo-distributed data lakes, use cases and tools that work in the geo-distributed data lake space are not hard to find. Here are some of my favorite ideas for implementing distributed data lakes:

Internet of Things (IoT)- A no-brainer. IoT data can be a pain in the…. and a lot of IoT use cases require real-time or near real-time analytics. When you combine the ability to pull in large amounts of structured and unstructured data with the replication of a geo-distributed architecture, you get the perfect IoT tool. Now your IoT strategy and implementation won’t get bogged down by how to store the massive amounts of data coming in from many devices across many locations.
Extract Transform Load (ETL)- Data lakes are famous for being an excellent route when working in ETL and this is because you can now extract and load your data into your data lake and transform it whenever you need. Imagine the power of this strategy being multiplied across different locations.
Enterprise and Big Datad- This makes total sense when you consider the benefits above, including agile development, data analytics, scalability, collaboration, and data sync.
Advanced and Real-Time Analytics- When you aren’t worried about the operational side of your data storage, you can focus on the juicy part of data collection, analyzing, and putting your data to work!

With these use cases in mind let's talk tools. HarperDB works for all of the above use cases because it can ingest both structured and unstructured data. That allows it to act as a data lake repository. THEN add in the advanced clustering and replication capabilities, followed up with SQL capabilities for analytical jobs and you are not only working with a geo-distributed data lake, but you can also use HarperDB simultaneously as a data warehouse! You can spin up HarperDB on any computing device, from large scale servers down to micro-computing devices like the Raspberry Pi. Other great tools to check out for geo-distributed data lakes include Snowflake, Cloudera, and Databricks, which I have linked to their Data Lake info pages if you are curious about how they all fit into the landscape. “Distributed” is becoming quite the buzzword for good reason, and it will be exciting to see how that space transforms as more and more teams adopt a distributed architecture.

Combining the flexibility of data lakes with the power of a distributed architecture is a no brainer in my opinion. Data lakes provide an easy way to ingest all types of data, store large amounts of historical data, and then use only the data that you need when you need it. Geo-distributed enables improved performance, cost savings, scalability, and better safety nets for the ever-growing data needs of modern-day enterprises and startups alike. As always, let me know what I missed and shoot me your ideas for my next “Explained By” blog. 👋