Priyansh Jain

Posted on Apr 30, 2019

Hands-on with distributed file systems and storage virtualization!

#virtualization #docker #distributedsystems #showdev

We had a course this spring, Virtualization, which had a project component.
You must've heard about virtualization, how things look unified from one end but are an orchestration of multiple services at the other end. We've multiple types of virtualization - ones that range from virtualizing entire Operating Systems (dual boots / Oracle VirtualBox) to virtualizing the desktops. I had to select a topic for the project and on some research found storage virtualization interesting.

Here's the source code for the project.

Storage Virtualization is practically providing a unified view on a cluster of storage devices, managed in such a manner that we have data safety and size/performance optimization.

A decent issue with having a storage solution is data de-duplication, which means trying to remove unnecessary copies of files that take up your resources. And I intended to solve the issue with my project(a distributed file system) while getting hands-on experience with managing multiple servers and implementing a management system within it.

The stack

The web framework I am most comfortable working with is Express(Node.js), and hence chose the same coupled with EJS templating for the frontend views.
I also decided to make use of docker for containerized services and using docker swarm to orchestrate them.

Development Methodology

I wrote some general upload code in express using multer for managing them, then packaged it into a docker image. The base image was the LTS Node.js one, and I used docker-compose to do it in an organized manner.
Docker swarm is used to manage the cluster, and provides fault tolerance by having replicas and whatnot. What I wanted here was easy to perform hostname mappings for my storage servers.
The dockerfile I used for building the image

FROM node:10

# Create app directory
WORKDIR /usr/src/app

# Install app dependencies
# A wildcard is used to ensure both package.json AND package-lock.json are copied
# where available (npm@5+)
COPY package*.json ./

RUN npm install
# If you are building your code for production
# RUN npm install --only=production

# Bundle app source
COPY . .

We see that I set the working directory as /usr/src/app. During development, I mounted the current folder as a volume, so nodemon would automatically detect all the changes and reflect them immediately. The development docker-compose file -

version: "3.2"

networks:
  testnet:
    external:
      name: testnet

services:
  server1_server:
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
      # placement:
      #   constraints:
      #     - node.hostname == your-hostname-2-here
    image: presto412/storev1
    hostname: server1.example.com
    command: ./node_modules/.bin/nodemon
    volumes:
      - "/tmp/uploads:/tmp/uploads"
      - ".:/usr/src/app/"
    environment:
      - SELF_HOSTNAME=server1.example.com
    ports:
      - "3000:3000"
    networks:
      testnet:
        aliases:
          - server1.example.com

First attempt

Since I had developed on blockchain platforms during the past year, my initial go-to logic was to make use of not block-chaining itself, but the tech that lies underneath it, Distributed Ledger Technology.

The general process flow -

The file is uploaded via a form
On upload, the backend hashes the file contents by making use of the sha1 library. I also considered using xxHash but didn't implement it because speed wasn't a concern for a small indie project.
It maintains a mapping JSON of the following structure

{
  "0347f8b35b22104339b5f9d9d1c8d3f0251b6bdc": {
    "details": {
      "name": "JKSUCI_AuthorAgreement (2).pdf",
      "mimetype": "application/pdf",
      "date": "Sun, 07 Apr 2019 23:55:04 GMT",
      "size": 27087
    },
    "backups": [
      {
        "hostname": "bangalore.storage.com",
        "path": "/tmp/uploads/528518c7e306e2f04e34117a5b245e6f"
      },
      {
        "hostname": "amsterdam.storage.com",
        "path": "/tmp/uploads/600405f52831535e54aaf590f2b14052"
      }
    ]
  }
}

So here you can see that the key is the hash of the file, and the value is where it is stored and some metadata.
This new file's hash is now checked with the above JSON, and if it exists, the system denies the upload and doesn't save the file.
Otherwise, it randomly selects two servers to store it in and updates the mapping as well. This mapping is then updated for every other server, thus making it a distributed ledger implementation.

At this point, the project was ready and showed the implementation for the first review. Professor suggested to include some meaningful changes, and the one I liked most was delivering files by geographical proximity to the server.

Improvements

I moved to the cloud this time. Created 5 droplets on digitalocean, with each one having a separate data centre - New York, Toronto, Singapore, Bangalore, Amsterdam.
I changed the architecture a bit this time - had a central server that served the frontend and maintained metadata of each file stored(I know that it is a bottleneck, but was enough for a class project)

What I also did, was hash the files in the client side, by making use of the FileReader API coupled with the Forge library. Here's some code for it.

  <script type="text/javascript">
    function submitFile() {
      var form = document.getElementById("uploadFileForm");
      var reader = new FileReader();
      reader.onload = function() {
        var fileContent = reader.result;
        var md = forge.md.sha1.create();
        md.update(fileContent);
        var fileContentHash = md.digest().toHex();
        document.getElementById("fileHash").value = fileContentHash;
        $.ajax({
          type: "POST",
          url: "/checkHash",
          contentType: "application/json;charset=utf-8",
          data: JSON.stringify({
            fileHash: fileContentHash
          }),
          success: function(response) {
            console.log(response);
            console.log("sresponse", response.success);
            if (!response.success) {
              form.submit();
            } else {
              console.log("not submtting form");
              alert("File already exists");
            }
          },
          error: function(response) {
            console.log("fresponse", response);
          }
        });
      };
      const file = document.getElementById("fileItem").files[0];
      reader.readAsText(file);
    }
  </script>

So now, only the file hash was being sent to the server as an ajax, and if the hash wasn't present, the file was freshly uploaded.

When the central server received the file, it selected two storage servers to store it in and then updated its metadata file.

So this time, when a new request for a file comes in, I used APIs like ipstack and distance24 to determine what location(city, preferably) the IP is from, and how far the city was from each of the storage servers. The servers are then sorted by distance, and the a URL is generated for the closest server, and returned. When the user clicks on the link, the file is downloaded.

Docker and Docker Swarm have been extremely helpful in this scenario, since I can simply specify the deployment constraints, pull and distribute images and so on.

What was your first project with distributed systems? Let me know in the comments below. Here's the source for the project.

Thanks for reading!

Priyansh Jain
Github | LinkedIn

DEV Community

Hands-on with distributed file systems and storage virtualization!

The stack

Development Methodology

First attempt

Improvements

Thanks for reading!

Top comments (0)

Read next

Best Practices for Writing Efficient and Maintainable Dockerfiles

Containers vs Virtual Machines: A Developer's Migration Journey

Distributed computing made easy

deploy Jenkins using docker compose with production ready