DEV Community

Abdullah Haggag
Abdullah Haggag

Posted on

Building a Big Data Playground Sandbox for Learning

Introduction

As a data engineer, I'm always seeking opportunities to experiment with different data solutions. Whether it's learning a new tool, practicing a solution, or testing ideas in a safe environment, the desire to innovate never ceases. To facilitate this, I've created a personal sandbox using Docker containers, featuring various big data tools. This setup, which I call the "Big-data Ecosystem Sandbox (BES)," leverages open-source big data tools orchestrated within Docker using custom-built images.

Sandbox Components

The BES includes a comprehensive set of tools essential for big data processing and analysis:

Sandbox Components

Data Storage and Management

  • PostgreSQL: An open-source relational database for structured data storage and complex queries.
  • MinIO: A high-performance, distributed object storage system compatible with Amazon S3 API.
  • Hadoop: An open-source framework for distributed storage and processing of large datasets.

Data Processing and Analytics

  • Hive: A data warehouse infrastructure built on Hadoop for querying and managing large datasets.
  • Spark: A fast, distributed computing system for large-scale data processing.
  • Trino: A distributed SQL query engine for querying data across various sources.

Streaming and Real-time Processing

  • Kafka: A distributed event streaming platform for building real-time data pipelines.
  • Flink: A stream processing framework for real-time data processing and event-driven applications.

Data Orchestration and Management

  • NiFi: An easy to use, powerful, and reliable system to process and distribute data.
  • Airflow: A platform to programmatically author, schedule, and monitor workflows.

Getting Started

You can find the GitHub Repo through the following link:

https://github.com/amhhaggag/bigdata-ecosystem-sandbox

Setup ALL the Sandbox Tools

“Make sure that you have enough CPU and RAM”

To Setup all the sandbox tools use the following script

git clone https://github.com/amhhaggag/bigdata-ecosystem-sandbox.git
cd bigdata-ecosystem-sandbox

./bes-setup.sh
Enter fullscreen mode Exit fullscreen mode

This script will do the following:

  1. Pull the necessary Docker images from Docker Hub
    • amhhaggag/hadoop-base-3.1.1
    • amhhaggag/hive-base-3.1.2
    • amhhaggag/spark-3.5.1
  2. Prepare the PostgreSQL Database for Hive-Metastore Service
  3. Add the Trino Configurations to it’s specific mounted volume (Local Directory)
  4. Create & Start all the containers

Now, let’s explain what is included in this repository:

Sandbox Architecture

The BES uses a combination of official Docker images and custom-built images to ensure compatibility and integration between tools. The custom images include Apache Hadoop, Hive, Spark, Airflow, and Trino, built in a hierarchical manner to maintain dependencies and ensure smooth integration.

Below is a diagram illustrating the dependencies between the custom built images.

Custom Images Diagram

Docker Compose Overview

To be able to use the sandbox efficiently you need to have at least basic knowledge of Docker and Docker Compose. Here is a quick overview on Docker Compose

A Docker Compose file, typically named docker-compose.yml, is a YAML file that defines, configures, and runs multi-container Docker applications. It allows you to manage all your application's services, networks, and volumes in a single place, streamlining deployment and scaling processes.

Here's the general structure of a Docker Compose file:

services:
  service_name:
    image: image_name:tag  # Use an existing image
    build:
      context: ./path  # Path to the build context
      dockerfile: Dockerfile  # Dockerfile to use for building the image
    ports:
      - "host_port:container_port"  # Map host ports to container ports
    environment:
      - VARIABLE=value  # Set environment variables
    volumes:
      - host_path:container_path  # Mount host paths or volumes
    networks:
      - network_name  # Connect to specified networks
    depends_on:
      - other_service  # Specify service dependencies

networks:
  network_name:
    driver: bridge  # Specify the network driver

volumes:
  volume_name:
    driver: local  # Specify the volume driver
Enter fullscreen mode Exit fullscreen mode

Key Components Explained:

  • services: Defines individual services (containers) that make up your application.
    • service_name: A unique identifier for each service.
      • image: Specifies the Docker image to deploy.
      • build: Instructions for building a Docker image from a Dockerfile.
      • ports: Exposes container ports to the host machine.
      • environment: Sets environment variables within the container.
      • volumes: Mounts host directories or named volumes into the container.
      • networks: Connects the service to one or more networks.
      • depends_on: Specifies service dependencies to control startup order.
  • networks: (Optional) Defines custom networks for your services to communicate.
    • network_name: The name of the network.
      • driver: The network driver to use (e.g., bridge, overlay).
  • volumes: (Optional) Defines named volumes for persistent data storage.
    • volume_name: The name of the volume.
      • driver: The volume driver to use.

Practical Example

Below is an example of a Docker Compose file of the PostgreSQL Service:

services:
  postgres:
    image: postgres:14
    container_name: postgres
    volumes:
      - ./mnt/postgres:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: "admin"
      POSTGRES_USER: "admin"
      POSTGRES_PASSWORD: "admin"
    ports:
      - "5432:5432"

 networks:
  default:
    name: bes-network

Enter fullscreen mode Exit fullscreen mode

Explanation of the Example:

  • Services
    • Service Name: postgres
    • image: the image that the container will use and deploy
    • container_name: the container will be created with this name “postgres”
    • volumes: the local directory “mnt/postgres” will be mounted and synced with the container directory “/var/lib/postgresql/data” to persist the data of the container in case we removed the container and started it again.
    • environment: specifies the environment variables that will be passed to the container
    • ports: the local port 5432 (on the left) will be mapped to the container port 5432 (on the right)
  • Networks
    • defining a network called “bes-network” through which all the related containers on the same network will be able to communicate together.

Basic Docker Commands

Here are some fundamental Docker commands to help you interact with containers:

  • docker ps: List running containers Example: docker ps
  • docker-compose up: Create and start containers defined in docker-compose.yml Example: docker-compose up -d
  • docker start: Start a stopped container Example: docker start my_container
  • docker exec: Execute a command in a running container Example: docker exec -it my_container bash
  • docker logs: View the logs of a container Example: docker logs my_container
  • docker cp: Copy files/folders between a container and the local filesystem Example: docker cp my_container:/path/to/file.txt /local/path/
  • docker stop: Stop a running container Example: docker stop my_container
  • docker rm: Remove a container Example: docker rm my_container
  • docker-compose down: Stop and remove containers, networks, and volumes defined in docker-compose.yml Example: docker-compose down

These commands will help you manage your Docker containers effectively in the Big-data Ecosystem Sandbox.

Practical Applications

The BES opens up a world of possibilities for data engineering experiments and learning. Some potential use cases include:

  • Setting up a data lake using MinIO and processing it with Spark
  • Creating real-time data pipelines with Kafka and Flink
  • Orchestrating complex data workflows using Airflow
  • Performing distributed SQL queries across multiple data sources with Trino

Conclusion

The Big-data Ecosystem Sandbox provides a comprehensive environment for learning and experimenting with various big data tools. By leveraging Docker and custom integrations, it offers a flexible and powerful platform for data engineers to enhance their skills and explore new ideas.

In future posts, we'll dive deeper into specific use cases and advanced configurations to help you get the most out of your BES. Stay tuned, and happy data engineering!

Top comments (0)