MongoDB GridFS, Made Simple

#mongodb #node #gridfs #javascript

Introduction

In fact, when you come to choosing your uploading methodology, there are a lot of options you can go with. One of these options is saving your files as binary data into the database, MongoDB GridFS applies this pattern. It is a file system abstraction on top of MongoDB in which the uploaded file is divided into chunks during the uploading process and reassembled during retrieval.

How GridFS Works

Lets represent how GridFS works in simple steps:

During the first file upload, a new bucket fs (unless you specify its name) will be created (if not exist) and this bucket consists of two collections (fs.chunks and fs.files).
A new index will be created (if not exist) in both collections for the sake of fast retrieval.
The uploaded file will be divided into chunks (by default 255KB per chunk unless you specify the chunk size) and stored in the fs.chunks collection. And to track the uploaded file portions ordering, this collection contains a field n which is the portion order.
A new metadata document will be created for the uploaded file in the fs.files collection containing its length, chunkSize, uploadedDate, filename, and contentType.
In the retrieval process, GridFS gets the file metadata from fs.files collection and uses this data to reassemble the file chunks from fs.chunks collection and return the file to the client as a stream or in memory.

When to Use GridFS over ordinary Filesystem Storage

In fact, you can go with GridFS if you have a requirement of these:

If your file size exceeds 16MB (which is the default MongoDB document size limit).
If you frequently want to access or update specific file portions without retrieving the entire file into memory.
If your file system limits the number of files in a directory, you can use GridFS to store as many files as you need.
If you want to track the metadata of your files. Which is provided as a built-in feature in GridFS.
As your files are part of your database, then your files can benefit from MongoDB's built-in replication, backup, and sharding features instead of handling them manually in the file system.
In fact, deleting files in GridFs is very easy as deleting an object in the database, in contrast to the file system, deleting is a bit more overwhelming.

GridFS Limitations

In fact, there is no one-fits-all solution in the world. So bare in mind these limitations:

Continuously serving big files from the database as many chunks can indeed affect your working set (A 16MB file is retrieved as 65 chunks with 255KB for each) especially if you deal with gigabytes or terabytes of data.
Serving a file from the database is a bit slower than serving it from the file system.
GridFS doesnt natively provide a way to update the entire file atomically. So if your system frequently updates the entire file, dont use GridFS or use a workaround as discussed below.

How to mitigate GridFS Limitations

These are some best practices when dealing with GridFS which mitigate its limitations:

To mitigate the working set consumption, you can serve your files from another MongoDB server dedicated to the GridFS storage .
Also, for the working set consumption, you can increase the chunk size instead of 255KB.
Regarding the atomic update, if your system tends to update the entire files frequently or access the files concurrently by many users, then you can use the versioning approach to track the files updates. So based on your needs, you can retrieve only the latest version of the file and delete the other versions or consider them as the file's history.

Hands-on example using Node.js

In this example, we will know how to upload, download and retrieve files from a bucket using GridFS.

I assume you are familiar with Node.js.

First of all, lets create (if not exist) or retrieve our bucket:

let bucket;
const connection = mongoose.createConnection('mongodb://localhost:27017/gridfs'); // `gridfs` is the database, you can name it as you want
// Listen to the open of the database connection to create (if not exist) or retrieve our bucket reference
connection.once('open', () => {
  bucket = new mongoose.mongo.GridFSBucket(connection, {
    bucketName: 'uploads', // Override the default bucket name (fs)
    chunkSizeBytes: 1048576 // Override the default chunk size (255KB)
  });
});

Lets upload a file using GridFS:

// With first upload, the `uploads` bucket will be created if not exist
const storage = new GridFsStorage({
  db: connection,
  file: (req, file) => ({
    filename: `${file.originalname}_${Date.now()}`, // Override the default filename
    bucketName: 'uploads', // Override the default bucket name (fs)
    chunkSize: 500000, // Override the default chunk size (255KB)
    metadata: { uploadedBy: 'Someone', downloadCount: 4 } // Attach any metadata to the uploaded file
  })
});
const upload = multer({ storage }); // Use GridFS as a multer storage

// Use multer as a middleware to upload the file
app.post('/upload', upload.single('file'), (req, res) => {
  res.json(req.file);
});

Bear in mind, that you can depend on the previous code to create your bucket during the first upload instead of the first step. But to guarantee the bucket creation after database connection and having a reference to the bucket.

Lets list our files metadata:

app.get('/metadata', async (req, res) => {
  try {
    // The find() method returns a cursor that manages the results of your query
    const cursor = bucket.find({});
    // Retrieve the data as array
    const filesMetadata = await cursor.toArray();
    res.json(filesMetadata);
  } catch (err) {
    res.json({ err: `Error: ${err.message}` });
  }
});

The find method returns a FindCursor which you can iterate through to get your result. ThetoArray promise replaces the cursor with an array.

To retrieve a specific file metadata:

app.get('/metadata/:id', async (req, res) => {
  try {
    const _id = mongoose.Types.ObjectId(req.params.id);
    const cursor = bucket.find({ _id });
    const filesMetadata = await cursor.toArray();
    res.json(filesMetadata[0] || null);
  } catch (err) {
    res.json({ err: `Error: ${err.message}` });
  }
});

Finally, lets download a file:

app.get('/file/:id', async (req, res) => {
  try {
    const _id = mongoose.Types.ObjectId(req.params.id);
    // Getting the file first is only a guard to avoid FileNotFound error
    const cursor = bucket.find({ _id });
    const filesMetadata = await cursor.toArray();
    if (!filesMetadata.length) return res.json({ err: 'Not a File!' });
    // You can simply stream a file like this with its id
    bucket.openDownloadStream(_id).pipe(res);
  } catch (err) {
    res.json({ err: `Error: ${err.message}` });
  }
});

Thats it, you can find this code here in this repo.

Conclusion

At the end of the day, as we saw there is no one-size-fits-all solution, so choosing GridFS as your storage option is your decision and depends on your needs and your understanding of the pros and cons of the available options.