Lukáš

Posted on Nov 4

Building a TypeScript Video Editor as a Solo Dev

#svelte #webdev #javascript #typescript

4 years after embarking on an exciting SaaS building journey, it's the right time to rebuild one of the key components of our app.

Simple video editor for social media videos written in JavaScript.

Here's the stack I decided to use for this rewrite, which is now a work in progress.

Svelte 5

Since our frontend is written in SvelteKit, this is the best option for our use case.

The video editor is a separate private npm library I can simply add to our frontend. It's a headless library, so the video editor UI is completely isolated.

The video editor library is responsible for syncing the video and audio elements with the timeline, rendering animations and transitions, rendering HTML texts into canvas, and much more.

SceneBuilderFactory takes in a scene JSON object as an argument to create a scene. StateManager.svelte.ts then keeps the current state of the video editor in real time.

This is super useful for drawing and updating the playhead position in the timeline, and much more.

Pixi.js

Pixi.js is an outstanding JavaScript canvas library.

Initially, I started to build this project with Pixi v8, but due to some reasons I'll mention later in this article, I decided to go with Pixi v7.

However, the video editor library is not tightly coupled to any dependencies, so it's easy to replace them if needed or to test different tools.

GSAP

For timeline management and complex animations, I decided to use GSAP.

There's no other tool in the JavaScript ecosystem I'm aware of that allows building nested timelines, combined animations, or complex text animations in such a simple way.

I have a GSAP business license, so I can also leverage additional tools to make more things simple.

Key Challenges

Before we dive in into stuff I use in the backend, let's see some challenges you need to solve while building a video editor in javascript.

Synchronize video/audio with the timeline

This question is often asked in the GSAP forum.

It doesn't matter if you use GSAP for timeline management or not, what you need to do is a couple of things.

On each render tick:

Get video relative time to the timeline. Let's say your video starts playing from the beginning at the 10-second mark of the timeline.

Well, before 10 seconds you actually don't care about the video element, but as soon as it enters the timeline, you need to keep it in sync.

You can do this by computing the relative time of the video, which must be computed from the video element's currentTime, compared against current scene time and within an acceptable "lag" period.

If the lag is larger than, let's say, 0.3 seconds, you need to auto-seek the video element to fix its sync with the main timeline. This also applies to audio elements as well.

Other things you need to consider:

handle play / pause / ended states
handle seeking

Play and pause are simple to implement. For seeking, I add the video seeking component id into our svelte StateManager, which will automatically change the state to "loading".

StateManager has an EventManager dependency and on each state change, it automatically triggers a "changestate" event, so we can listen to these events without using $effect.

The same thing happens after seeking is finished and the video is ready to play.

This way we can show a loading indicator instead of play / pause button in our UI when some of the components are loading.

Text rendering is not as simple as you think

CSS, GSAP, and GSAP's TextSplitter allow me to do really amazing stuff with text elements.

Native canvas text elements are limited, and since the primary use case of our app is to create short-form videos for social media, they are not a good fit.

Luckily, I found a way to render almost any HTML text into canvas, which is crutial for rendering the video output.

Pixi HTMLText

This would have been the simplest solution; unfortunately, it did not work for me.

When I was animating HTML text with GSAP, it was lagging significantly, and it also did not support many Google fonts I tried with it.

Satori

Satori is amazing, and I can imagine it being used in some simpler use cases. Unfortunately, some GSAP animations change styles that are not compatible with Satori, which results in an error.

SVG with foreign object

Finally, I made a custom solution to solve this.

The tricky part was supporting emojis and custom fonts, but I managed to solve that.

I created an SVGGenerator class that has a generateSVG method, which produces an SVG like this:

<svg xmlns="http://www.w3.org/2000/svg" width="${width}" height="${height}" viewBox="0 0 ${width} ${height}" version="1.1">${styleTag}<foreignObject width="100%" height="100%"><div xmlns="http://www.w3.org/1999/xhtml" style="transform-origin: 0 0;">${html}</div></foreignObject></svg>

The styleTag then looks like this:

<style>@font-face { font-family: ${fontFamilyName}; src: url('${fontData}') }</style>

For this to work, the HTML that we pass in needs to have the correct font-family set inside inline style. Font data needs to be a base64 encoded data string, something like data:font/ttf;base64,longboringstring

3. Component Lifecycle

Composition over inheritance, they say.

As an exercise to get my hands dirty, I refactored from an inheritance-based approach to hook-based system.

In my video editor, I call elements like VIDEO, AUDIO, TEXT, SUBTITLES, IMAGE, SHAPE, etc. components.

Before rewriting this, there was an abstract class BaseComponent, and each component class was extending it, so VideoComponent had logic for videos, etc.

The problem was that it became a mess pretty quickly.

Components were responsible for how they are rendered, how they manage their Pixi texture, how they are animated, and more.

Now, there is only one component class, which is very simple.

This now has four lifecycle events:

- setup
- update    // called on each render tick, video rewind, frame export...
- refresh   // called when user changes component data in UI
- destroy

This component class has a method called addHook that changes its behavior.

Hooks can hook into component lifecycle events and perform actions.

For example, there is a MediaHook that I use for video and audio components.

MediaHook creates the underlying audio or video element and automatically keeps it in sync with the main timeline.

For building components, I used the builder pattern along with the director pattern (see reference).

This way, when building an audio component, I add MediaHook to it, which I also add to video components. However, videos also need additional hooks for:

Creating the texture
Setting up the sprite
Setting the right location in the scene
Handling rendering

This approach makes it very easy to change, extend, or modify the rendering logic or how the components behave in the scene.

Backend and Rendering

I tried multiple different approaches on how to render videos in the fastest and most cost-efficient ways.

In 2020, I started with the simplest approach - rendering one frame after another, which is something that many tools do.

After some trial-and-error, I switched to a rendering layers approach.

That means our SceneData document contains layers which contain components.

Each of these layers is rendered separately and then combined with ffmpeg to create the final output.

The limitation was that a layer can only contain components of the same type.

For example, a layer with video cannot contain text elements; it can only contain other videos.

This obviously has some pros and cons.

It was quite simple to render HTML texts with animations on Lambda independently and turn them into transparent videos, which were then combined with other chunks for the final output.

On the other hand, layers with video components were simply processed with ffmpeg.

However, this approach had a huge drawback.

If I wanted to implement a keyframes system to scale, fade, or rotate the video, I would need to make ports of these features in fluent-ffmpeg.

That is definitely possible, but with all the other responsibilities I have, I simply didn't make it.

So I decided to go back to the first approach - rendering one frame after another.

Express and BullMQ

Rendering requests are sent to the backend server with Express.

This route checks if the video isn't being rendered yet, and if not, it's added into the BullMQ queue.

Playwright / Puppeteer

After the queue starts processing the render, it spawns multiple instances of headless Chrome.

Note: this processing happens on a dedicated Hetzner server with AMD EPYC 7502P 32-Core Processor and 128 GB RAM, so it's quite a performant machine.

Keep in mind Chromium doesn't have codecs, so I use Playwright which makes it trivial to install Chrome.

But still, the video frames came out black for some reason.

I'm sure I was just missing something; however, I decided to split the video components into individual image frames and use these in the serverless browser instead of using videos.

But still, the most important part was to avoid using the screenshot method.

Since we have everything in one canvas, we can get it into an image with .getDataURL() on the canvas, which is much faster.

To make this simpler, I made a static page that bundles the video editor and adds some functions into window.

This is then loaded with Playwright/Puppeteer, and on each frame, I simply call:

const frameData = await page.evaluate(`window.setFrame(${frameNumber})`);

This gives me the frame data that I can either save as an image or add into a buffer to render the video chunk.

This whole process is split into 5-10 different workers, depending on the video length, which are merged into the final output.

Instead of this, it can be offloaded to something like Lambda as well, but I'm leaning towards using RunPod. The only drawback of their serverless architecture is they use Python, which I'm not that familiar with.

This way, the rendering might be split into multiple chunks that are processed on the cloud, and even rendering of a 60-minute video can be done in a minute or two. Nice to have, but that's not our primary goal or use case.

What I Did NOT Solve (Yet)

The reason I downgraded from Pixi 8 to Pixi 7 is because Pixi 7 also has the "legacy" version that supports 2D canvas. This is MUCH faster for rendering. A 60-second video takes around 80 seconds to render on the server, but if the canvas has WebGL or WebGPU context, I was able to render only 1-2 frames per second.

Interestingly enough, serverless Chrome was much slower compared to headful Firefox when rendering WebGL canvases, according to my testing.

Even using a dedicated GPU didn't help speed up the rendering by any significant margin. Either I was doing something wrong, or simply headless Chrome isn't very performant with WebGL.

WebGL in our use case is great for transitions, which are usually quite short.

One of the ways I plan to test regarding this is to render WebGL and non-WebGL chunks separately.

Other Components

There are many parts involved in the project.

Scene data is stored on MongoDB, since the structure of the documents makes most sense to be stored in a schemaless database.

The frontend, written in SvelteKit, uses urql as a GraphQL client.

The GraphQL server uses PHP Laravel with MongoDB and the amazing Lighthouse GraphQL.

But this is a theme maybe for the next time.

Wrapping Up

So that's it for now! There's a lot of work that needs to be done before putting this into production and replacing the current video editor, which is quite buggy and reminds me a bit of Frankenstein.

Let me know what you think and keep on rockin'!

DEV Community