A multi-modal telegram bot I recently made was a resounding success. π I was surprised how many people took advantage of it and forked/starred it on github. But I wanted something more.
I decided to create a service where people can create their own comics, fairy tales, and indeed any stories. Preferably with the push of a button.
My idea was to create a program that could generate stories based on a small number of parameters. It was the language, the seed for generating text, the visual setting, and so on. I knew that for this I needed to use GPT-4, some kind of API for pictures, a translator, and a speech synthesizer. After a quick check, it turned out that all this is available and not so expensive!
The following picture popped into my head:
It took a little less than two weeks to complete. But please do not judge too harshly - after all, I am a backend developer, frontend developer, devops engineer, designer, product manager and marketer in one person. And someone else.
Some technical points will be described below.
Images
I decided to use the good old Stable Diffusion, because it is cheap (even open source, but I use the API) and draws pretty well, but MidJourney is still closed.
I generate an image corresponding to the description of each step of the scene. In addition, I added various visual styles and settings to make the images more appealing and relevant to the context of the scene. For example, I used image styling in the style of Star Wars, Disney, Marvel, etc. All this is at the user's choice.
As a result, I get a set of images in the same style, which are ready for video generation.
Recently, in one community, an almost brilliant idea was thrown - not to create pictures, but to google them on Google Pictures. It's free, fast, and even better in some cases, like news. I will definitely implement.
Sound
When I first started working on the project, I ran into a problem - how to make it so that users can not only read, but also listen to the created stories?
And then the idea came to my mind to voice pieces of text through Google Text-to-Speech. It allows you to create realistic voice accompaniment in different languages ββand with different voices.
You just need to break the text generated by GPT-4 into paragraphs and send each paragraph for voiceover. Thus, users can read the story and listen to the voiced version of it at the same time. This makes the reading process more interesting and fun, and also helps people who prefer to listen to text rather than read it.
Video
The most difficult part was building the story through videoshow.js.
To create a video, I created objects that will contain information about each frame of the video. To do this, I used something like this code:
const videoOptions = {
fps: 25,
loop: 5,
transition: true,
transitionDuration: 1,
videoBitrate: 1024,
videoCodec: "libx264",
size: "640x?",
audioBitrate: "128k",
audioChannels: 2,
format: "mp4",
pixelFormat: "yuv420p",
};
const imageDescriptions = [
{ path: "path/to/image1.jpg", caption: "Caption 1" },
{ path: "path/to/image2.jpg", caption: "Caption 2" },
{ path: "path/to/image3.jpg", caption: "Caption 3" },
];
const audio = "path/to/combined/audio.mp3";
In this code, I have defined video parameters such as frame rate, cycle time, video and audio bitrate, as well as video size and file format. An array of objects is also created, each of which contains the path to the image and its description, as well as the path to one large audio file.
Next, I created an array of objects, each of which will represent a video frame:
const frames = [];
for (let i = 0; i < imageDescriptions.length; i++) {
const image = imageDescriptions[i];
const frame = {
path: image.path,
caption: image.caption,
loop: 5,
};
frames.push(frame);
}
In this code, I iterate over each picture and create a frame object that contains the path to the picture, its description, and the duration of the frame.
Finally, the video is assembled using the objects created earlier:
videoshow(frames, videoOptions)
.audio(audio)
.save("path/to/output.mp4")
.on("start", function (command) {
console.log("ffmpeg process started:", command);
})
.on("error", function (err, stdout, stderr) {
console.error("Error:", err);
})
.on("end", function (output) {
console.log("Video created in:", output);
});
Quite a lot of time was spent on debugging all this. And here, for example, one of the resulting stories:
https://mangatv.shop/api/video/QNPsdmz_9I5GiyZ1p5Jlp.mp4
Globalization
The story generator is not tied to a language, it is completely global. In fact, in any language from the Google Text-2-Speech list .
So my plans include launching the US market, ProductHunt, Y Combinator and all that. πI would be glad for any support in this direction.
Philosophical questions
Finally, the use of AI-generated content raises several philosophical questions. For example, what is the human role in creating and using such content? What are the ethical issues associated with using artificial intelligence to create content that can mimic the human mind and behavior? What is the future of AI-generated content creation and use, and how will this affect our culture and society as a whole? These questions require serious discussion and reflection so that we can make the most of the potential of artificial intelligence in our world.
But I decided to do it first, and then think about itπ
Will the automatic content be of sufficient quality?
Today, there are algorithms that are able to create sufficiently high-quality texts, sound, and images. However, so far they cannot replace human creativity and create something completely new and original.
The story editing feature can help make the content better and more interesting. Editing allows you to improve and refine individual slides, correct errors, add new elements, and finally place emphasis. In addition, the editor can always make a creative contribution.
What do you think? Is the project interesting? Would you use? What monetization methods do you recommend?
UPD: you can see the project at this link
Top comments (0)