Cover image for Improve Video Accessibility with Captions

Improve Video Accessibility with Captions

lauragift21 profile image Gift Egwuenu Updated on ・5 min read

Have you ever come across a video on a streaming website where the captions are so good that you read those instead of listening to the audio probably because you are in a noisy environment or you just want some quite? I can personally say I do this sometimes but the truth is, not so many streaming services provide this feature and what's even more important is to consider those without the ability to hear they rely on the captions to understand what is happening in the video which makes sure every video content is accessible to all users. In this article, I'll show you how to autogenerate captions for any video content using Cloudinary.

For our demo, I have a video of me without captions and we'll walk through how to generate a caption for the video. Here's a demo on codepen

<video autoplay muted controls width="700">
  <source src="gift.mp4" type="video/mp4">

  <source src="gift.webm" type="video/webm">

  <p>Your browser doesn't support HTML5 video. Here is
     a <a href="gift.mp4">link to the video</a> instead.

I have the video muted by default and without captions, you can't really tell what I am saying until you unmute it which is one step away and could turn users away in some cases.

HTML video supports adding captions to your video with the <track> tag. Let's say you already have a caption file for the video above you can add it to the video by including the file with the track tag.

<track default kind="captions" srclang="en" src="gift.vtt" />

The attribute default indicates the captions should be shown by default, kind indicates the purpose of the text shown i.e captions or subtitles and srclang indicate the language used and lastly src is the location of the text file.

You can manually generate the captions by hand or have a web service do that automatically for you. We'll be using Cloudinary add-ons to achieve this.

Adding Captions with Cloudinary

Cloudinary is a cloud-based service that provides an end-to-end image and video management solution including uploads, storage, manipulations, optimizations, and delivery. An added benefit of using Cloudinary as a media solution is you also get the Add-ons feature which enables you to enhance your images and videos using functionalities offered by Cloudinary's vision and image processing partners.

A list of Cloudinary Addons

We are lucky because two different add-ons from the list can be used to accomplish what we are trying to solve.Microsoft Azure Video Indexer and Google AI Video Transcription. Now let's see how to get these services to work.

The first step to take is to upload the video to Cloudinary using the Media Upload API. Cloudinary allows you to upload media to the cloud and perform transformations through the browser or using server-side code for this tutorial we'll go with the latter.

Let's use Cloudinary Node.js SDK to upload the video. To get started, we need to install:

  yarn add cloudinary dotenv

Next, sign up for an account if you don't already have one or login to the dashboard to get your account details.

Create a new file index.js and import these packages

var cloudinary = require('cloudinary').v2;

Create a .env file and set your env variables with details from the dashboard

  cloud_name: process.env.cloud_name,
  api_key: process.env.api_key,
  api_secret: process.env.api_secret

Now to perform the upload action, the Cloudinary upload method sends an authenticated upload API call over HTTPS while sending the video file:

cloudinary.uploader.upload('gift.mp4', {
  resource_type: "video",
  public_id: "gift",
}, function(error, result) {
  console.log(result, error)

The code block above will upload the video to the cloud but before doing that we should first enable captions on the video. By including a raw_convert parameter with a value '', this tells cloudinary to generate a caption for the video. I mentioned we have two add-ons option earlier this is how you can use either of them in your code.

raw_convert: 'azure_video_indexer'
raw_convert: 'google_speech'

cloudinary.uploader.upload('gift.mp4', {
  resource_type: "video",
  public_id: "gift",
  raw_convert: 'azure_video_indexer' // raw_convert: 'google_speech'

}, function(error, result) {
  console.log(result, error)

Also, you can request transcription/captions in the different languages and (optionally) region/dialect.

 raw_convert: "azure_video_indexer:fr-FR"
 raw_convert: "google_speech:de-DE"

A full list of supported languages and region codes is available on Google Cloud speech-to-text language support

Uploading is performed synchronously, and once finished, the uploaded video is immediately available for manipulation and delivery.

Media Dashboard

Cloudinary delivers the caption in three different formats by default .srt, .vvt and .transcript but you can always specify the one you want by appending the format on the raw_convert parameter.

  raw_convert: "azure_video_indexer:srt:vtt"
  raw_convert: "google_speech:srt:vtt"

Now, let's use the captions generated with our video.

<video crossorigin autoplay muted controls width="100%">
  <source src="https://res.cloudinary.com/lauragift/video/upload/v1582792249/gift.mp4" type="video/mp4">

  <source src="https://res.cloudinary.com/lauragift/video/upload/v1582792249/gift.webm" type="video/webm">

  <track kind="captions" srclang="en" src="https://res.cloudinary.com/lauragift/raw/upload/v1582792283/gift.mp4.en-US.azure.vtt" default>

  <p>Your browser doesn't support HTML5 video. Here is
     a <a href="https://res.cloudinary.com/lauragift/video/upload/v1582792249/gift.mp4">link to the video</a> instead.

This is a helpful approach in making sure your video content is accessible but always remember that no speech recognition tool is 100% accurate. If exact accuracy is important for your video, you can download the generated .transcript, .srt or .vtt file, edit them manually and overwrite the original files.


Accessibility shouldn't be an afterthought and as we make more effort in delivering more accessible websites to our users we can pay close attention to media on the web. Going the extra mile to make sure that image or video is accessible to everyone will go a long way in creating a friendly and accessible web for everyone.


Cloudinary Upload Demo
Google AI Video Transcription Docs
Microsoft Azure Video IndexerDocs
GitHub Repo

Originally published on my blog

Posted on by:

lauragift21 profile

Gift Egwuenu


Frontend engineer based in Lagos Nigeria. I'm passionate about making the web accessible to everyone and also an advocate for building open-source software and developer communities.


Editor guide

Love the idea of generating and embedding transcriptions, thanks Gift!


Awesome! I'm glad you liked it.


Thanks for inspiring me to write this feature request:

Add automatic transcription to video articles on upload #6649

rhymes avatar
rhymes commented on Mar 16, 2020

Is your feature request related to a problem? Please describe.

Inspired by this DEV post by @lauragift21 about adding automatic captions with Cloudinary I thought it might be nice to consider having something similar for DEV.

The post details how it can be done via Cloudinary, essentially using either of these services:

The services create machine generated transcriptions.

As we don't upload directly to Cloudinary (but to S3 and let Cloudinary fetch from there) I'm not sure if we can use those add-ons but maybe they can be invoked passing the URL of the S3 uploaded and transcoded video.

AWS also has its own similar service AWS Transcribe which can be used via API. See also this tutorial: Create an Audio Transcript

Describe the solution you'd like

It should definitely be a per video setting during upload (enabled by default) and there might be privacy considerations by using an external service but this is how I imagine it works:

  • the video gets uploaded to S3
  • Google's, Microsoft's or Amazon's transcription service is invoked (we could test them all to see which one is the most accurate)
  • the transcription is added to the video shown to the users, with a button to display/hide it, like [CC] for YouTube videos

Additional context

Bonus: let the user download, edit and reupload the transcripts to their own videos



Oh wow! Happy to see this as a feature request it'll be a huge improvement to an already amazing platform :)